From d4bb88515a2240b9cb48a519f780f8217a611be2 Mon Sep 17 00:00:00 2001
From: benjas <909336740@qq.com>
Date: Sun, 24 Jan 2021 16:07:12 +0800
Subject: [PATCH] Add. Data cleaning
---
.../数据清洗-checkpoint.ipynb | 1480 ++++++++++++++++-
.../数据清洗.ipynb | 1057 ++++++++++++
2 files changed, 2535 insertions(+), 2 deletions(-)
diff --git a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/数据清洗-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/数据清洗-checkpoint.ipynb
index 2fd6442..074d363 100644
--- a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/数据清洗-checkpoint.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/数据清洗-checkpoint.ipynb
@@ -1,6 +1,1482 @@
{
- "cells": [],
- "metadata": {},
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 任务:京东用户购买意向预测"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 故事背景:\n",
+ "京东作为中国最大的自营式电商,在保持高速发展的同时,沉淀了数亿的忠实用户,积累了海量的真实数据。如何从历史数据中找出规律,去预测用户未来的购买需求,让最合适的商品遇见最需要的人,是大数据应用在精准营销中的关键问题,也是所有电商平台在做智能化升级时所需要的核心技术。\n",
+ "\n",
+ "以京东商城真实的用户、商品和行为数据(脱敏后)为基础,通过数据挖掘的技术和机器学习的算法,构建用户购买商品的预测模型,输出高潜用户和目标商品的匹配结果,为精准营销提供高质量的目标群体。\n",
+ "\n",
+ "目标:使用京东多个品类下商品的历史销售数据,构建算法模型,预测用户在未来5天内,对某个目标品类下商品的购买意向。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 数据集:\n",
+ "这里涉及到的数据集是京东的数据集:\n",
+ "\n",
+ "* JData_User.csv 用户数据集 105,321个用户\n",
+ "* JData_Comment.csv 商品评论 558,552条记录\n",
+ "* JData_Product.csv 预测商品集合 24,187条记录\n",
+ "* JData_Action_201602.csv 2月份行为交互记录 11,485,424条记录\n",
+ "* JData_Action_201603.csv 3月份行为交互记录 25,916,378条记录\n",
+ "* JData_Action_201604.csv 4月份行为交互记录 13,199,934条记录"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**JData_User.csv用户数据**\n",
+ "\n",
+ "|字段|意义|备注|\n",
+ "|-|-|-|\n",
+ "|user_id|用户id|脱敏|\n",
+ "|age|年龄|-1表未知|\n",
+ "|sex|性别|0男,1女,2未知|\n",
+ "|user_lv_cd|用户等级|级别枚举,越高级别越大|\n",
+ "|user_reg_tm|用户注册日期|粒度到天|"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**JData_Comment.csv评论数据**\n",
+ "\n",
+ "|字段|意义|备注|\n",
+ "|-|-|-|\n",
+ "|dt|截止时间|天,到2016-02-01|\n",
+ "|sku_id|商品编号|脱敏|\n",
+ "|comment_num|累积评论数分段|0表示无评论,1表是1条,2表示2-10条,3表示11-50条,5表示大于50条|\n",
+ "|has_bad_comment|是否有差评|0表示无,1表示有|\n",
+ "|bad_comment_rate|差评率|差评数占总评论数的比率|"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**JData_Product.csv商品数据**\n",
+ "\n",
+ "|字段|意义|备注|\n",
+ "|-|-|-|\n",
+ "|sku_id|商品编号|脱敏|\n",
+ "|a1|属性1|枚举,-1表未知|\n",
+ "|a2|属性2|枚举,-1表未知|\n",
+ "|a3|属性3|枚举,-1表未知|\n",
+ "|cate|品牌ID|脱敏|\n",
+ "|brand|品牌ID|脱敏|"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**JData_Action_xx.csv商品数据**\n",
+ "\n",
+ "|字段|意义|备注|\n",
+ "|-|-|-|\n",
+ "|user_id|用户ID|脱敏|\n",
+ "|sku_id|商品编号|脱敏|\n",
+ "|time|行为时间||\n",
+ "|model_id|点击板块的编号|脱敏|\n",
+ "|type|行为类型|1.浏览商品详情页;2.加入购物车;3.购物车删除;4.下单;5.关注;6.点击;|\n",
+ "|cate|品牌ID|脱敏|\n",
+ "|brand|品牌ID|脱敏|"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 数据挖掘流程:\n",
+ "(一).数据清洗\n",
+ "1. 数据集完整性验证\n",
+ "2. 数据集中是否存在缺失值\n",
+ "3. 数据集中各特征数值应该如何处理\n",
+ "4. 哪些数据是我们想要的,哪些是可以过滤掉的\n",
+ "5. 将有价值数据信息做成新的数据源\n",
+ "6. 去除无行为交互的商品和用户\n",
+ "7. 去掉浏览量很大而购买量很少的用户(惰性用户或爬虫用户)\n",
+ "\n",
+ "(二).数据理解与分析\n",
+ "1. 掌握各个特征的含义\n",
+ "2. 观察数据有哪些特点,是否可利用来建模\n",
+ "3. 可视化展示便于分析\n",
+ "4. 用户的购买意向是否随着时间等因素变化\n",
+ "(三).特征提取\n",
+ "1. 基于清洗后的数据集哪些特征是有价值\n",
+ "2. 分别对用户与商品以及其之间构成的行为进行特征提取\n",
+ "3. 行为因素中哪些是核心?如何提取?\n",
+ "4. 瞬时行为特征or累计行为特征?\n",
+ "\n",
+ "(四).模型建立\n",
+ "1. 使用机器学习算法进行预测\n",
+ "2. 参数设置与调节\n",
+ "3. 数据集切分"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 数据集完整性验证\n",
+ "首先检查JData_User中的用户和JData_Dction中的用户是否一致,保证行为数据中锁产生的行为均由用户数据中的用户产生。\n",
+ "\n",
+ "思路:利用pd.Merge连接sku和Action中的sku,观测Action中的数据是否减少Example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " sku data\n",
+ "0 a 1\n",
+ "1 a 1\n",
+ "2 c 3\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 测试方法\n",
+ "import pandas as pd\n",
+ "df1 = pd.DataFrame({'sku':['a','a','e','c'], 'data':[1,1,2,3]})\n",
+ "df2 = pd.DataFrame({'sku':['a','b','c']})\n",
+ "print(pd.merge(df1,df2))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "结果只会打印两者共有的部分"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Is action of Feb. from User file? True\n",
+ "Is action of Mar. from User file? True\n",
+ "Is action of Apr. from User file? True\n"
+ ]
+ }
+ ],
+ "source": [
+ "#数据集验证\n",
+ "def user_action_check():\n",
+ " df_user = pd.read_csv('data/JData_User.csv',encoding='gbk')\n",
+ " df_sku = df_user.loc[:,'user_id'].to_frame()\n",
+ " df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n",
+ " # pd.merge(df_sku,df_month2) 会以user_id字段为基准取两个df的交集 不是取并集,这样才能证明 action中的userid 都在df_user里面\n",
+ " print ('Is action of Feb. from User file? ', len(df_month2) == len(pd.merge(df_sku,df_month2))) \n",
+ " df_month3 = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n",
+ " print ('Is action of Mar. from User file? ', len(df_month3) == len(pd.merge(df_sku,df_month3)))\n",
+ " df_month4 = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n",
+ " print ('Is action of Apr. from User file? ', len(df_month4) == len(pd.merge(df_sku,df_month4)))\n",
+ "\n",
+ "user_action_check() "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "结论:User数据集中的用户和交互行为数据集中的用户完全一致\n",
+ "\n",
+ "根据merge前后的数据量对,能保障Action中的用户ID是User中的ID的子集"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 检查是否有重复记录\n",
+ "除去各个数据文件中完全重复的记录,可能解释是重复数据是有意义的,比如用户同时购买多件商品,同时添加多个数量的商品到购物车等…"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#重复数据\n",
+ "def deduplicate(filepath, filename, newpath):\n",
+ " df_file = pd.read_csv(filepath,encoding='gbk') \n",
+ " before = df_file.shape[0]\n",
+ " df_file.drop_duplicates(inplace=True) # 列相同认为是重复 inplace=True表示在原来的DataFrame上删除重复项4\n",
+ " after = df_file.shape[0]\n",
+ " n_dup = before-after # 查看前后差值\n",
+ " print ('Number of duplicate records for ' + filename + ' is: ' + str(n_dup))\n",
+ " if n_dup != 0:\n",
+ " df_file.to_csv(newpath, index=None)\n",
+ " else:\n",
+ " print ('Number duplicate records in ' + filename)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Number of duplicate records for Feb. action is: 2756093\n",
+ "Number of duplicate records for Mar. action is: 7085038\n",
+ "Number of duplicate records for Feb. action is: 3672710\n",
+ "Number of duplicate records for Comment is: 0\n",
+ "Number duplicate records in Comment\n",
+ "Number of duplicate records for Product is: 0\n",
+ "Number duplicate records in Product\n",
+ "Number of duplicate records for User is: 0\n",
+ "Number duplicate records in User\n"
+ ]
+ }
+ ],
+ "source": [
+ "deduplicate('data/JData_Action_201602.csv', 'Feb. action', 'data/JData_Action_201602_dedup.csv')\n",
+ "deduplicate('data/JData_Action_201603.csv', 'Mar. action', 'data/JData_Action_201603_dedup.csv')\n",
+ "deduplicate('data/JData_Action_201604.csv', 'Feb. action', 'data/JData_Action_201604_dedup.csv')\n",
+ "deduplicate('data/JData_Comment.csv', 'Comment', 'data/JData_Comment_dedup.csv')\n",
+ "deduplicate('data/JData_Product.csv', 'Product', 'data/JData_Product_dedup.csv')\n",
+ "deduplicate('data/JData_User.csv', 'User', 'data/JData_User_dedup.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " sku_id | \n",
+ " time | \n",
+ " model_id | \n",
+ " cate | \n",
+ " brand | \n",
+ "
\n",
+ " \n",
+ " type | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 1 | \n",
+ " 2176378 | \n",
+ " 2176378 | \n",
+ " 2176378 | \n",
+ " 0 | \n",
+ " 2176378 | \n",
+ " 2176378 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 636 | \n",
+ " 636 | \n",
+ " 636 | \n",
+ " 0 | \n",
+ " 636 | \n",
+ " 636 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 1464 | \n",
+ " 1464 | \n",
+ " 1464 | \n",
+ " 0 | \n",
+ " 1464 | \n",
+ " 1464 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 37 | \n",
+ " 37 | \n",
+ " 37 | \n",
+ " 0 | \n",
+ " 37 | \n",
+ " 37 | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " 1981 | \n",
+ " 1981 | \n",
+ " 1981 | \n",
+ " 0 | \n",
+ " 1981 | \n",
+ " 1981 | \n",
+ "
\n",
+ " \n",
+ " 6 | \n",
+ " 575597 | \n",
+ " 575597 | \n",
+ " 575597 | \n",
+ " 545054 | \n",
+ " 575597 | \n",
+ " 575597 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id sku_id time model_id cate brand\n",
+ "type \n",
+ "1 2176378 2176378 2176378 0 2176378 2176378\n",
+ "2 636 636 636 0 636 636\n",
+ "3 1464 1464 1464 0 1464 1464\n",
+ "4 37 37 37 0 37 37\n",
+ "5 1981 1981 1981 0 1981 1981\n",
+ "6 575597 575597 575597 545054 575597 575597"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# 查看重复数据\n",
+ "df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n",
+ "IsDuplicated = df_month2.duplicated()\n",
+ "df_d = df_month2[IsDuplicated]\n",
+ "df_d.groupby('type').count() # 发现重复数据大多数都是由于浏览(1),或者点击(6)产生"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 检查是否存在注册时间在2016年-4月-15号之后的用户\n",
+ "统计的是4月15号前的客户行为,不应该包含4月15号后的注册客户。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " age | \n",
+ " sex | \n",
+ " user_lv_cd | \n",
+ " user_reg_tm | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 7457 | \n",
+ " 207458 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7463 | \n",
+ " 207464 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 2 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7467 | \n",
+ " 207468 | \n",
+ " 36-45岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7472 | \n",
+ " 207473 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7482 | \n",
+ " 207483 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7492 | \n",
+ " 207493 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7493 | \n",
+ " 207494 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7503 | \n",
+ " 207504 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 4 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7510 | \n",
+ " 207511 | \n",
+ " 46-55岁 | \n",
+ " 2.0 | \n",
+ " 5 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7512 | \n",
+ " 207513 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7518 | \n",
+ " 207519 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 2 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7521 | \n",
+ " 207522 | \n",
+ " 26-35岁 | \n",
+ " 0.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7525 | \n",
+ " 207526 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7533 | \n",
+ " 207534 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7543 | \n",
+ " 207544 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7544 | \n",
+ " 207545 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7551 | \n",
+ " 207552 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7553 | \n",
+ " 207554 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 4 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 8545 | \n",
+ " 208546 | \n",
+ " 16-25岁 | \n",
+ " 0.0 | \n",
+ " 2 | \n",
+ " 2016-04-29 | \n",
+ "
\n",
+ " \n",
+ " 9394 | \n",
+ " 209395 | \n",
+ " 16-25岁 | \n",
+ " 1.0 | \n",
+ " 2 | \n",
+ " 2016-05-11 | \n",
+ "
\n",
+ " \n",
+ " 10362 | \n",
+ " 210363 | \n",
+ " 56岁以上 | \n",
+ " 2.0 | \n",
+ " 2 | \n",
+ " 2016-05-24 | \n",
+ "
\n",
+ " \n",
+ " 10367 | \n",
+ " 210368 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-05-24 | \n",
+ "
\n",
+ " \n",
+ " 11019 | \n",
+ " 211020 | \n",
+ " 36-45岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-06-06 | \n",
+ "
\n",
+ " \n",
+ " 12014 | \n",
+ " 212015 | \n",
+ " 36-45岁 | \n",
+ " 2.0 | \n",
+ " 2 | \n",
+ " 2016-07-05 | \n",
+ "
\n",
+ " \n",
+ " 13850 | \n",
+ " 213851 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-09-11 | \n",
+ "
\n",
+ " \n",
+ " 14542 | \n",
+ " 214543 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-10-05 | \n",
+ "
\n",
+ " \n",
+ " 16746 | \n",
+ " 216747 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-11-25 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id age sex user_lv_cd user_reg_tm\n",
+ "7457 207458 -1 2.0 1 2016-04-15\n",
+ "7463 207464 26-35岁 2.0 2 2016-04-15\n",
+ "7467 207468 36-45岁 2.0 3 2016-04-15\n",
+ "7472 207473 -1 2.0 1 2016-04-15\n",
+ "7482 207483 26-35岁 2.0 3 2016-04-15\n",
+ "7492 207493 16-25岁 2.0 3 2016-04-15\n",
+ "7493 207494 16-25岁 2.0 3 2016-04-15\n",
+ "7503 207504 16-25岁 2.0 4 2016-04-15\n",
+ "7510 207511 46-55岁 2.0 5 2016-04-15\n",
+ "7512 207513 -1 2.0 1 2016-04-15\n",
+ "7518 207519 26-35岁 2.0 2 2016-04-15\n",
+ "7521 207522 26-35岁 0.0 3 2016-04-15\n",
+ "7525 207526 -1 2.0 3 2016-04-15\n",
+ "7533 207534 -1 2.0 1 2016-04-15\n",
+ "7543 207544 26-35岁 2.0 3 2016-04-15\n",
+ "7544 207545 -1 2.0 1 2016-04-15\n",
+ "7551 207552 26-35岁 2.0 3 2016-04-15\n",
+ "7553 207554 16-25岁 2.0 4 2016-04-15\n",
+ "8545 208546 16-25岁 0.0 2 2016-04-29\n",
+ "9394 209395 16-25岁 1.0 2 2016-05-11\n",
+ "10362 210363 56岁以上 2.0 2 2016-05-24\n",
+ "10367 210368 -1 2.0 1 2016-05-24\n",
+ "11019 211020 36-45岁 2.0 3 2016-06-06\n",
+ "12014 212015 36-45岁 2.0 2 2016-07-05\n",
+ "13850 213851 26-35岁 2.0 3 2016-09-11\n",
+ "14542 214543 -1 2.0 1 2016-10-05\n",
+ "16746 216747 16-25岁 2.0 1 2016-11-25"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# check user who’s user_reg_tm >= '2016-4-15'\n",
+ "df_user = pd.read_csv('./data/JData_User.csv',encoding='gbk')\n",
+ "df_user['user_reg_tm']=pd.to_datetime(df_user['user_reg_tm']) \n",
+ "df_user.loc[df_user.user_reg_tm>= '2016-4-15']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "检查依然存在4月15号后注册的,如果这些客户没有4月15号后的行为数据,说明要删除。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " sku_id | \n",
+ " time | \n",
+ " model_id | \n",
+ " type | \n",
+ " cate | \n",
+ " brand | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "Empty DataFrame\n",
+ "Columns: [user_id, sku_id, time, model_id, type, cate, brand]\n",
+ "Index: []"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_month = pd.read_csv('data/JData_Action_201604.csv')\n",
+ "df_month['time'] = pd.to_datetime(df_month['time'])\n",
+ "df_month.loc[df_month.time >= '2016-4-16']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "说明客户没有交互数据,所以这一批客户不需要删除"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 行为数据中的user_id为浮点型,进行INT类型转换"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "int64\n",
+ "int64\n",
+ "int64\n"
+ ]
+ }
+ ],
+ "source": [
+ "df_month = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n",
+ "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
+ "print (df_month['user_id'].dtype)\n",
+ "df_month.to_csv('data/JData_Action_201602.csv',index=None)\n",
+ " \n",
+ "df_month = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n",
+ "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
+ "print (df_month['user_id'].dtype)\n",
+ "df_month.to_csv('data/JData_Action_201603.csv',index=None)\n",
+ " \n",
+ "df_month = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n",
+ "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
+ "print (df_month['user_id'].dtype)\n",
+ "df_month.to_csv('data/JData_Action_201604.csv',index=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 年龄区间的处理\n",
+ "查看用户年龄分布,并做特征编码"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " 3.0 46570\n",
+ " 4.0 30336\n",
+ "-1.0 14412\n",
+ " 2.0 8797\n",
+ " 5.0 3325\n",
+ " 6.0 1871\n",
+ " 1.0 7\n",
+ "Name: age, dtype: int64\n"
+ ]
+ }
+ ],
+ "source": [
+ "age_mapping = { \n",
+ " '15岁以下': 1, \n",
+ " '16-25岁': 2, \n",
+ " '26-35岁': 3,\n",
+ " '36-45岁': 4,\n",
+ " '46-55岁': 5,\n",
+ " '56岁以上': 6,\n",
+ " '-1' :-1\n",
+ " } \n",
+ "df_user['age'] = df_user['age'].map(age_mapping)\n",
+ "print(df_user.age.value_counts())\n",
+ "df_user.to_csv('data\\JData_User.csv',index=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "为了能够进行上述清洗,在此首先构造了简单的用户(user)行为特征和商品(item)行为特征,对应于两张表user_table和item_table."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### user_table\n",
+ "* user_table特征包括:\n",
+ "* user_id(用户id),age(年龄),sex(性别),\n",
+ "* user_lv_cd(用户级别),browse_num(浏览数),\n",
+ "* addcart_num(加购数),delcart_num(删购数),\n",
+ "* buy_num(购买数),favor_num(收藏数),\n",
+ "* click_num(点击数),buy_addcart_ratio(购买加购转化率),\n",
+ "* buy_browse_ratio(购买浏览转化率),\n",
+ "* buy_click_ratio(购买点击转化率),\n",
+ "* buy_favor_ratio(购买收藏转化率)\n",
+ "\n",
+ "### item_table特征包括:\n",
+ "* sku_id(商品id),attr1,attr2,\n",
+ "* attr3,cate,brand,browse_num,\n",
+ "* addcart_num,delcart_num,\n",
+ "* buy_num,favor_num,click_num,\n",
+ "* buy_addcart_ratio,buy_browse_ratio,\n",
+ "* buy_click_ratio,buy_favor_ratio,\n",
+ "* comment_num(评论数),\n",
+ "* has_bad_comment(是否有差评),\n",
+ "* bad_comment_rate(差评率)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 构建User_table"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 定义文件名\n",
+ "ACTION_201602_FILE = \"data/JData_Action_201602.csv\" # 11M条\n",
+ "ACTION_201603_FILE = \"data/JData_Action_201603.csv\" #26M 条\n",
+ "ACTION_201604_FILE = \"data/JData_Action_201604.csv\" #13M条\n",
+ "COMMENT_FILE = \"data/JData_Comment.csv\" #560K条\n",
+ "PRODUCT_FILE = \"data/JData_Product.csv\" #24k\n",
+ "USER_FILE = \"data/JData_User.csv\" # 105K 条\n",
+ " \n",
+ "USER_TABLE_FILE = \"data/user_table.csv\"\n",
+ "ITEM_TABLE_FILE = \"data/item_table.csv\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "from collections import Counter"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 功能函数: 对每一个user分组的数据进行统计\n",
+ "def add_type_count(group):\n",
+ " behavior_type = group.type.astype(int) \n",
+ " # 用户行为类别\n",
+ " type_cnt = Counter(behavior_type)\n",
+ " # 1: 浏览 2: 加购 3: 删除\n",
+ " # 4: 购买 5: 收藏 6: 点击\n",
+ " group['browse_num'] = type_cnt[1]\n",
+ " group['addcart_num'] = type_cnt[2]\n",
+ " group['delcart_num'] = type_cnt[3]\n",
+ " group['buy_num'] = type_cnt[4]\n",
+ " group['favor_num'] = type_cnt[5]\n",
+ " group['click_num'] = type_cnt[6]\n",
+ " \n",
+ " return group[['user_id', 'browse_num', 'addcart_num',\n",
+ " 'delcart_num', 'buy_num', 'favor_num',\n",
+ " 'click_num']]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "由于用户行为数据量较大,一次性读入可能造成内存错误(Memory Error),因此使用pandas的分块(chunk)读取"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#对action数据进行统计\n",
+ "#根据自己调节chunk_size大小\n",
+ "def get_from_action_data(fname, chunk_size=50000):\n",
+ " reader = pd.read_csv(fname, header=0, iterator=True,encoding='gbk')\n",
+ " chunks = []\n",
+ " loop = True\n",
+ " while loop:\n",
+ " try:\n",
+ " # 只读取user_id和type两个字段\n",
+ " chunk = reader.get_chunk(chunk_size)[[\"user_id\", \"type\"]]\n",
+ " chunks.append(chunk)\n",
+ " except StopIteration:\n",
+ " loop = False\n",
+ " print(\"Iteration is stopped\")\n",
+ " # 将块拼接为pandas dataframe格式\n",
+ " df_ac = pd.concat(chunks, ignore_index=True)\n",
+ " # 按user_id分组,对每一组进行统计,as_index 表示无索引形式返回数据\n",
+ " df_ac = df_ac.groupby(['user_id'], as_index=False).apply(add_type_count)\n",
+ " # 将重复的行丢弃\n",
+ " df_ac = df_ac.drop_duplicates('user_id')\n",
+ " \n",
+ " return df_ac"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 将各个action数据的统计量进行聚合\n",
+ "def merge_action_data():\n",
+ " df_ac = []\n",
+ " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n",
+ " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n",
+ " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n",
+ " \n",
+ " df_ac = pd.concat(df_ac, ignore_index=True)\n",
+ " # 用户在不同action表中统计量求和\n",
+ " df_ac = df_ac.groupby(['user_id'], as_index=False).sum()\n",
+ " # 构造转化率字段\n",
+ " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n",
+ " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n",
+ " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n",
+ " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n",
+ " \n",
+ " # 将大于1的转化率字段置为1(100%)\n",
+ " df_ac.loc[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n",
+ " df_ac.loc[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n",
+ " df_ac.loc[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n",
+ " df_ac.loc[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n",
+ " \n",
+ " return df_ac"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 从FJData_User表中抽取需要的字段\n",
+ "def get_from_jdata_user():\n",
+ " df_usr = pd.read_csv(USER_FILE, header=0,encoding='gbk')\n",
+ " df_usr = df_usr[[\"user_id\", \"age\", \"sex\", \"user_lv_cd\"]]\n",
+ " return df_usr"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " age | \n",
+ " sex | \n",
+ " user_lv_cd | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 200001 | \n",
+ " 6.0 | \n",
+ " 2.0 | \n",
+ " 5 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 200002 | \n",
+ " -1.0 | \n",
+ " 0.0 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 200003 | \n",
+ " 4.0 | \n",
+ " 1.0 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 200004 | \n",
+ " -1.0 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 200005 | \n",
+ " 2.0 | \n",
+ " 0.0 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id age sex user_lv_cd\n",
+ "0 200001 6.0 2.0 5\n",
+ "1 200002 -1.0 0.0 1\n",
+ "2 200003 4.0 1.0 4\n",
+ "3 200004 -1.0 2.0 1\n",
+ "4 200005 2.0 0.0 4"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_base = get_from_jdata_user()\n",
+ "user_base.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Iteration is stopped\n",
+ "Iteration is stopped\n",
+ "Iteration is stopped\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " browse_num | \n",
+ " addcart_num | \n",
+ " delcart_num | \n",
+ " buy_num | \n",
+ " favor_num | \n",
+ " click_num | \n",
+ " buy_addcart_ratio | \n",
+ " buy_browse_ratio | \n",
+ " buy_click_ratio | \n",
+ " buy_favor_ratio | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 200001 | \n",
+ " 212 | \n",
+ " 22 | \n",
+ " 13 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 414 | \n",
+ " 0.045455 | \n",
+ " 0.004717 | \n",
+ " 0.002415 | \n",
+ " 1.0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 200002 | \n",
+ " 238 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 484 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 200003 | \n",
+ " 221 | \n",
+ " 4 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 420 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 200004 | \n",
+ " 52 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 61 | \n",
+ " NaN | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 200005 | \n",
+ " 106 | \n",
+ " 2 | \n",
+ " 3 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 161 | \n",
+ " 0.500000 | \n",
+ " 0.009434 | \n",
+ " 0.006211 | \n",
+ " 0.5 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id browse_num addcart_num delcart_num buy_num favor_num \\\n",
+ "0 200001 212 22 13 1 0 \n",
+ "1 200002 238 1 0 0 0 \n",
+ "2 200003 221 4 1 0 1 \n",
+ "3 200004 52 0 0 0 0 \n",
+ "4 200005 106 2 3 1 2 \n",
+ "\n",
+ " click_num buy_addcart_ratio buy_browse_ratio buy_click_ratio \\\n",
+ "0 414 0.045455 0.004717 0.002415 \n",
+ "1 484 0.000000 0.000000 0.000000 \n",
+ "2 420 0.000000 0.000000 0.000000 \n",
+ "3 61 NaN 0.000000 0.000000 \n",
+ "4 161 0.500000 0.009434 0.006211 \n",
+ "\n",
+ " buy_favor_ratio \n",
+ "0 1.0 \n",
+ "1 NaN \n",
+ "2 0.0 \n",
+ "3 NaN \n",
+ "4 0.5 "
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_behavior = merge_action_data()\n",
+ "user_behavior.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 连接成一张表,类似于SQL的左连接(left join)\n",
+ "user_behavior = pd.merge(user_base, user_behavior, on=['user_id'], how='left')\n",
+ "# 保存为user_table.csv\n",
+ "user_behavior.to_csv(USER_TABLE_FILE, index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " age | \n",
+ " sex | \n",
+ " user_lv_cd | \n",
+ " browse_num | \n",
+ " addcart_num | \n",
+ " delcart_num | \n",
+ " buy_num | \n",
+ " favor_num | \n",
+ " click_num | \n",
+ " buy_addcart_ratio | \n",
+ " buy_browse_ratio | \n",
+ " buy_click_ratio | \n",
+ " buy_favor_ratio | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 200001 | \n",
+ " 6.0 | \n",
+ " 2.0 | \n",
+ " 5 | \n",
+ " 212.0 | \n",
+ " 22.0 | \n",
+ " 13.0 | \n",
+ " 1.0 | \n",
+ " 0.0 | \n",
+ " 414.0 | \n",
+ " 0.045455 | \n",
+ " 0.004717 | \n",
+ " 0.002415 | \n",
+ " 1.0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 200002 | \n",
+ " -1.0 | \n",
+ " 0.0 | \n",
+ " 1 | \n",
+ " 238.0 | \n",
+ " 1.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 484.0 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 200003 | \n",
+ " 4.0 | \n",
+ " 1.0 | \n",
+ " 4 | \n",
+ " 221.0 | \n",
+ " 4.0 | \n",
+ " 1.0 | \n",
+ " 0.0 | \n",
+ " 1.0 | \n",
+ " 420.0 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 200004 | \n",
+ " -1.0 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 52.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 61.0 | \n",
+ " NaN | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 200005 | \n",
+ " 2.0 | \n",
+ " 0.0 | \n",
+ " 4 | \n",
+ " 106.0 | \n",
+ " 2.0 | \n",
+ " 3.0 | \n",
+ " 1.0 | \n",
+ " 2.0 | \n",
+ " 161.0 | \n",
+ " 0.500000 | \n",
+ " 0.009434 | \n",
+ " 0.006211 | \n",
+ " 0.5 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id age sex user_lv_cd browse_num addcart_num delcart_num \\\n",
+ "0 200001 6.0 2.0 5 212.0 22.0 13.0 \n",
+ "1 200002 -1.0 0.0 1 238.0 1.0 0.0 \n",
+ "2 200003 4.0 1.0 4 221.0 4.0 1.0 \n",
+ "3 200004 -1.0 2.0 1 52.0 0.0 0.0 \n",
+ "4 200005 2.0 0.0 4 106.0 2.0 3.0 \n",
+ "\n",
+ " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n",
+ "0 1.0 0.0 414.0 0.045455 0.004717 \n",
+ "1 0.0 0.0 484.0 0.000000 0.000000 \n",
+ "2 0.0 1.0 420.0 0.000000 0.000000 \n",
+ "3 0.0 0.0 61.0 NaN 0.000000 \n",
+ "4 1.0 2.0 161.0 0.500000 0.009434 \n",
+ "\n",
+ " buy_click_ratio buy_favor_ratio \n",
+ "0 0.002415 1.0 \n",
+ "1 0.000000 NaN \n",
+ "2 0.000000 0.0 \n",
+ "3 0.000000 NaN \n",
+ "4 0.006211 0.5 "
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_table = pd.read_csv(USER_TABLE_FILE)\n",
+ "user_table.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ }
+ },
"nbformat": 4,
"nbformat_minor": 2
}
diff --git a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/数据清洗.ipynb b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/数据清洗.ipynb
index 3d4b56e..074d363 100644
--- a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/数据清洗.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/数据清洗.ipynb
@@ -393,6 +393,1063 @@
"df_d.groupby('type').count() # 发现重复数据大多数都是由于浏览(1),或者点击(6)产生"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 检查是否存在注册时间在2016年-4月-15号之后的用户\n",
+ "统计的是4月15号前的客户行为,不应该包含4月15号后的注册客户。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " age | \n",
+ " sex | \n",
+ " user_lv_cd | \n",
+ " user_reg_tm | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 7457 | \n",
+ " 207458 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7463 | \n",
+ " 207464 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 2 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7467 | \n",
+ " 207468 | \n",
+ " 36-45岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7472 | \n",
+ " 207473 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7482 | \n",
+ " 207483 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7492 | \n",
+ " 207493 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7493 | \n",
+ " 207494 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7503 | \n",
+ " 207504 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 4 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7510 | \n",
+ " 207511 | \n",
+ " 46-55岁 | \n",
+ " 2.0 | \n",
+ " 5 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7512 | \n",
+ " 207513 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7518 | \n",
+ " 207519 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 2 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7521 | \n",
+ " 207522 | \n",
+ " 26-35岁 | \n",
+ " 0.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7525 | \n",
+ " 207526 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7533 | \n",
+ " 207534 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7543 | \n",
+ " 207544 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7544 | \n",
+ " 207545 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7551 | \n",
+ " 207552 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 7553 | \n",
+ " 207554 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 4 | \n",
+ " 2016-04-15 | \n",
+ "
\n",
+ " \n",
+ " 8545 | \n",
+ " 208546 | \n",
+ " 16-25岁 | \n",
+ " 0.0 | \n",
+ " 2 | \n",
+ " 2016-04-29 | \n",
+ "
\n",
+ " \n",
+ " 9394 | \n",
+ " 209395 | \n",
+ " 16-25岁 | \n",
+ " 1.0 | \n",
+ " 2 | \n",
+ " 2016-05-11 | \n",
+ "
\n",
+ " \n",
+ " 10362 | \n",
+ " 210363 | \n",
+ " 56岁以上 | \n",
+ " 2.0 | \n",
+ " 2 | \n",
+ " 2016-05-24 | \n",
+ "
\n",
+ " \n",
+ " 10367 | \n",
+ " 210368 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-05-24 | \n",
+ "
\n",
+ " \n",
+ " 11019 | \n",
+ " 211020 | \n",
+ " 36-45岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-06-06 | \n",
+ "
\n",
+ " \n",
+ " 12014 | \n",
+ " 212015 | \n",
+ " 36-45岁 | \n",
+ " 2.0 | \n",
+ " 2 | \n",
+ " 2016-07-05 | \n",
+ "
\n",
+ " \n",
+ " 13850 | \n",
+ " 213851 | \n",
+ " 26-35岁 | \n",
+ " 2.0 | \n",
+ " 3 | \n",
+ " 2016-09-11 | \n",
+ "
\n",
+ " \n",
+ " 14542 | \n",
+ " 214543 | \n",
+ " -1 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-10-05 | \n",
+ "
\n",
+ " \n",
+ " 16746 | \n",
+ " 216747 | \n",
+ " 16-25岁 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 2016-11-25 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id age sex user_lv_cd user_reg_tm\n",
+ "7457 207458 -1 2.0 1 2016-04-15\n",
+ "7463 207464 26-35岁 2.0 2 2016-04-15\n",
+ "7467 207468 36-45岁 2.0 3 2016-04-15\n",
+ "7472 207473 -1 2.0 1 2016-04-15\n",
+ "7482 207483 26-35岁 2.0 3 2016-04-15\n",
+ "7492 207493 16-25岁 2.0 3 2016-04-15\n",
+ "7493 207494 16-25岁 2.0 3 2016-04-15\n",
+ "7503 207504 16-25岁 2.0 4 2016-04-15\n",
+ "7510 207511 46-55岁 2.0 5 2016-04-15\n",
+ "7512 207513 -1 2.0 1 2016-04-15\n",
+ "7518 207519 26-35岁 2.0 2 2016-04-15\n",
+ "7521 207522 26-35岁 0.0 3 2016-04-15\n",
+ "7525 207526 -1 2.0 3 2016-04-15\n",
+ "7533 207534 -1 2.0 1 2016-04-15\n",
+ "7543 207544 26-35岁 2.0 3 2016-04-15\n",
+ "7544 207545 -1 2.0 1 2016-04-15\n",
+ "7551 207552 26-35岁 2.0 3 2016-04-15\n",
+ "7553 207554 16-25岁 2.0 4 2016-04-15\n",
+ "8545 208546 16-25岁 0.0 2 2016-04-29\n",
+ "9394 209395 16-25岁 1.0 2 2016-05-11\n",
+ "10362 210363 56岁以上 2.0 2 2016-05-24\n",
+ "10367 210368 -1 2.0 1 2016-05-24\n",
+ "11019 211020 36-45岁 2.0 3 2016-06-06\n",
+ "12014 212015 36-45岁 2.0 2 2016-07-05\n",
+ "13850 213851 26-35岁 2.0 3 2016-09-11\n",
+ "14542 214543 -1 2.0 1 2016-10-05\n",
+ "16746 216747 16-25岁 2.0 1 2016-11-25"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# check user who’s user_reg_tm >= '2016-4-15'\n",
+ "df_user = pd.read_csv('./data/JData_User.csv',encoding='gbk')\n",
+ "df_user['user_reg_tm']=pd.to_datetime(df_user['user_reg_tm']) \n",
+ "df_user.loc[df_user.user_reg_tm>= '2016-4-15']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "检查依然存在4月15号后注册的,如果这些客户没有4月15号后的行为数据,说明要删除。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " sku_id | \n",
+ " time | \n",
+ " model_id | \n",
+ " type | \n",
+ " cate | \n",
+ " brand | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "Empty DataFrame\n",
+ "Columns: [user_id, sku_id, time, model_id, type, cate, brand]\n",
+ "Index: []"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_month = pd.read_csv('data/JData_Action_201604.csv')\n",
+ "df_month['time'] = pd.to_datetime(df_month['time'])\n",
+ "df_month.loc[df_month.time >= '2016-4-16']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "说明客户没有交互数据,所以这一批客户不需要删除"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 行为数据中的user_id为浮点型,进行INT类型转换"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "int64\n",
+ "int64\n",
+ "int64\n"
+ ]
+ }
+ ],
+ "source": [
+ "df_month = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n",
+ "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
+ "print (df_month['user_id'].dtype)\n",
+ "df_month.to_csv('data/JData_Action_201602.csv',index=None)\n",
+ " \n",
+ "df_month = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n",
+ "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
+ "print (df_month['user_id'].dtype)\n",
+ "df_month.to_csv('data/JData_Action_201603.csv',index=None)\n",
+ " \n",
+ "df_month = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n",
+ "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
+ "print (df_month['user_id'].dtype)\n",
+ "df_month.to_csv('data/JData_Action_201604.csv',index=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 年龄区间的处理\n",
+ "查看用户年龄分布,并做特征编码"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " 3.0 46570\n",
+ " 4.0 30336\n",
+ "-1.0 14412\n",
+ " 2.0 8797\n",
+ " 5.0 3325\n",
+ " 6.0 1871\n",
+ " 1.0 7\n",
+ "Name: age, dtype: int64\n"
+ ]
+ }
+ ],
+ "source": [
+ "age_mapping = { \n",
+ " '15岁以下': 1, \n",
+ " '16-25岁': 2, \n",
+ " '26-35岁': 3,\n",
+ " '36-45岁': 4,\n",
+ " '46-55岁': 5,\n",
+ " '56岁以上': 6,\n",
+ " '-1' :-1\n",
+ " } \n",
+ "df_user['age'] = df_user['age'].map(age_mapping)\n",
+ "print(df_user.age.value_counts())\n",
+ "df_user.to_csv('data\\JData_User.csv',index=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "为了能够进行上述清洗,在此首先构造了简单的用户(user)行为特征和商品(item)行为特征,对应于两张表user_table和item_table."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### user_table\n",
+ "* user_table特征包括:\n",
+ "* user_id(用户id),age(年龄),sex(性别),\n",
+ "* user_lv_cd(用户级别),browse_num(浏览数),\n",
+ "* addcart_num(加购数),delcart_num(删购数),\n",
+ "* buy_num(购买数),favor_num(收藏数),\n",
+ "* click_num(点击数),buy_addcart_ratio(购买加购转化率),\n",
+ "* buy_browse_ratio(购买浏览转化率),\n",
+ "* buy_click_ratio(购买点击转化率),\n",
+ "* buy_favor_ratio(购买收藏转化率)\n",
+ "\n",
+ "### item_table特征包括:\n",
+ "* sku_id(商品id),attr1,attr2,\n",
+ "* attr3,cate,brand,browse_num,\n",
+ "* addcart_num,delcart_num,\n",
+ "* buy_num,favor_num,click_num,\n",
+ "* buy_addcart_ratio,buy_browse_ratio,\n",
+ "* buy_click_ratio,buy_favor_ratio,\n",
+ "* comment_num(评论数),\n",
+ "* has_bad_comment(是否有差评),\n",
+ "* bad_comment_rate(差评率)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 构建User_table"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 定义文件名\n",
+ "ACTION_201602_FILE = \"data/JData_Action_201602.csv\" # 11M条\n",
+ "ACTION_201603_FILE = \"data/JData_Action_201603.csv\" #26M 条\n",
+ "ACTION_201604_FILE = \"data/JData_Action_201604.csv\" #13M条\n",
+ "COMMENT_FILE = \"data/JData_Comment.csv\" #560K条\n",
+ "PRODUCT_FILE = \"data/JData_Product.csv\" #24k\n",
+ "USER_FILE = \"data/JData_User.csv\" # 105K 条\n",
+ " \n",
+ "USER_TABLE_FILE = \"data/user_table.csv\"\n",
+ "ITEM_TABLE_FILE = \"data/item_table.csv\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "from collections import Counter"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 功能函数: 对每一个user分组的数据进行统计\n",
+ "def add_type_count(group):\n",
+ " behavior_type = group.type.astype(int) \n",
+ " # 用户行为类别\n",
+ " type_cnt = Counter(behavior_type)\n",
+ " # 1: 浏览 2: 加购 3: 删除\n",
+ " # 4: 购买 5: 收藏 6: 点击\n",
+ " group['browse_num'] = type_cnt[1]\n",
+ " group['addcart_num'] = type_cnt[2]\n",
+ " group['delcart_num'] = type_cnt[3]\n",
+ " group['buy_num'] = type_cnt[4]\n",
+ " group['favor_num'] = type_cnt[5]\n",
+ " group['click_num'] = type_cnt[6]\n",
+ " \n",
+ " return group[['user_id', 'browse_num', 'addcart_num',\n",
+ " 'delcart_num', 'buy_num', 'favor_num',\n",
+ " 'click_num']]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "由于用户行为数据量较大,一次性读入可能造成内存错误(Memory Error),因此使用pandas的分块(chunk)读取"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#对action数据进行统计\n",
+ "#根据自己调节chunk_size大小\n",
+ "def get_from_action_data(fname, chunk_size=50000):\n",
+ " reader = pd.read_csv(fname, header=0, iterator=True,encoding='gbk')\n",
+ " chunks = []\n",
+ " loop = True\n",
+ " while loop:\n",
+ " try:\n",
+ " # 只读取user_id和type两个字段\n",
+ " chunk = reader.get_chunk(chunk_size)[[\"user_id\", \"type\"]]\n",
+ " chunks.append(chunk)\n",
+ " except StopIteration:\n",
+ " loop = False\n",
+ " print(\"Iteration is stopped\")\n",
+ " # 将块拼接为pandas dataframe格式\n",
+ " df_ac = pd.concat(chunks, ignore_index=True)\n",
+ " # 按user_id分组,对每一组进行统计,as_index 表示无索引形式返回数据\n",
+ " df_ac = df_ac.groupby(['user_id'], as_index=False).apply(add_type_count)\n",
+ " # 将重复的行丢弃\n",
+ " df_ac = df_ac.drop_duplicates('user_id')\n",
+ " \n",
+ " return df_ac"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 将各个action数据的统计量进行聚合\n",
+ "def merge_action_data():\n",
+ " df_ac = []\n",
+ " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n",
+ " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n",
+ " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n",
+ " \n",
+ " df_ac = pd.concat(df_ac, ignore_index=True)\n",
+ " # 用户在不同action表中统计量求和\n",
+ " df_ac = df_ac.groupby(['user_id'], as_index=False).sum()\n",
+ " # 构造转化率字段\n",
+ " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n",
+ " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n",
+ " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n",
+ " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n",
+ " \n",
+ " # 将大于1的转化率字段置为1(100%)\n",
+ " df_ac.loc[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n",
+ " df_ac.loc[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n",
+ " df_ac.loc[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n",
+ " df_ac.loc[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n",
+ " \n",
+ " return df_ac"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 从FJData_User表中抽取需要的字段\n",
+ "def get_from_jdata_user():\n",
+ " df_usr = pd.read_csv(USER_FILE, header=0,encoding='gbk')\n",
+ " df_usr = df_usr[[\"user_id\", \"age\", \"sex\", \"user_lv_cd\"]]\n",
+ " return df_usr"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " age | \n",
+ " sex | \n",
+ " user_lv_cd | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 200001 | \n",
+ " 6.0 | \n",
+ " 2.0 | \n",
+ " 5 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 200002 | \n",
+ " -1.0 | \n",
+ " 0.0 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 200003 | \n",
+ " 4.0 | \n",
+ " 1.0 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 200004 | \n",
+ " -1.0 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 200005 | \n",
+ " 2.0 | \n",
+ " 0.0 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id age sex user_lv_cd\n",
+ "0 200001 6.0 2.0 5\n",
+ "1 200002 -1.0 0.0 1\n",
+ "2 200003 4.0 1.0 4\n",
+ "3 200004 -1.0 2.0 1\n",
+ "4 200005 2.0 0.0 4"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_base = get_from_jdata_user()\n",
+ "user_base.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Iteration is stopped\n",
+ "Iteration is stopped\n",
+ "Iteration is stopped\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " browse_num | \n",
+ " addcart_num | \n",
+ " delcart_num | \n",
+ " buy_num | \n",
+ " favor_num | \n",
+ " click_num | \n",
+ " buy_addcart_ratio | \n",
+ " buy_browse_ratio | \n",
+ " buy_click_ratio | \n",
+ " buy_favor_ratio | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 200001 | \n",
+ " 212 | \n",
+ " 22 | \n",
+ " 13 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 414 | \n",
+ " 0.045455 | \n",
+ " 0.004717 | \n",
+ " 0.002415 | \n",
+ " 1.0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 200002 | \n",
+ " 238 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 484 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 200003 | \n",
+ " 221 | \n",
+ " 4 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 420 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 200004 | \n",
+ " 52 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 61 | \n",
+ " NaN | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 200005 | \n",
+ " 106 | \n",
+ " 2 | \n",
+ " 3 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 161 | \n",
+ " 0.500000 | \n",
+ " 0.009434 | \n",
+ " 0.006211 | \n",
+ " 0.5 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id browse_num addcart_num delcart_num buy_num favor_num \\\n",
+ "0 200001 212 22 13 1 0 \n",
+ "1 200002 238 1 0 0 0 \n",
+ "2 200003 221 4 1 0 1 \n",
+ "3 200004 52 0 0 0 0 \n",
+ "4 200005 106 2 3 1 2 \n",
+ "\n",
+ " click_num buy_addcart_ratio buy_browse_ratio buy_click_ratio \\\n",
+ "0 414 0.045455 0.004717 0.002415 \n",
+ "1 484 0.000000 0.000000 0.000000 \n",
+ "2 420 0.000000 0.000000 0.000000 \n",
+ "3 61 NaN 0.000000 0.000000 \n",
+ "4 161 0.500000 0.009434 0.006211 \n",
+ "\n",
+ " buy_favor_ratio \n",
+ "0 1.0 \n",
+ "1 NaN \n",
+ "2 0.0 \n",
+ "3 NaN \n",
+ "4 0.5 "
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_behavior = merge_action_data()\n",
+ "user_behavior.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 连接成一张表,类似于SQL的左连接(left join)\n",
+ "user_behavior = pd.merge(user_base, user_behavior, on=['user_id'], how='left')\n",
+ "# 保存为user_table.csv\n",
+ "user_behavior.to_csv(USER_TABLE_FILE, index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " user_id | \n",
+ " age | \n",
+ " sex | \n",
+ " user_lv_cd | \n",
+ " browse_num | \n",
+ " addcart_num | \n",
+ " delcart_num | \n",
+ " buy_num | \n",
+ " favor_num | \n",
+ " click_num | \n",
+ " buy_addcart_ratio | \n",
+ " buy_browse_ratio | \n",
+ " buy_click_ratio | \n",
+ " buy_favor_ratio | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 200001 | \n",
+ " 6.0 | \n",
+ " 2.0 | \n",
+ " 5 | \n",
+ " 212.0 | \n",
+ " 22.0 | \n",
+ " 13.0 | \n",
+ " 1.0 | \n",
+ " 0.0 | \n",
+ " 414.0 | \n",
+ " 0.045455 | \n",
+ " 0.004717 | \n",
+ " 0.002415 | \n",
+ " 1.0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 200002 | \n",
+ " -1.0 | \n",
+ " 0.0 | \n",
+ " 1 | \n",
+ " 238.0 | \n",
+ " 1.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 484.0 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 200003 | \n",
+ " 4.0 | \n",
+ " 1.0 | \n",
+ " 4 | \n",
+ " 221.0 | \n",
+ " 4.0 | \n",
+ " 1.0 | \n",
+ " 0.0 | \n",
+ " 1.0 | \n",
+ " 420.0 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " 0.0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 200004 | \n",
+ " -1.0 | \n",
+ " 2.0 | \n",
+ " 1 | \n",
+ " 52.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 61.0 | \n",
+ " NaN | \n",
+ " 0.000000 | \n",
+ " 0.000000 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 200005 | \n",
+ " 2.0 | \n",
+ " 0.0 | \n",
+ " 4 | \n",
+ " 106.0 | \n",
+ " 2.0 | \n",
+ " 3.0 | \n",
+ " 1.0 | \n",
+ " 2.0 | \n",
+ " 161.0 | \n",
+ " 0.500000 | \n",
+ " 0.009434 | \n",
+ " 0.006211 | \n",
+ " 0.5 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id age sex user_lv_cd browse_num addcart_num delcart_num \\\n",
+ "0 200001 6.0 2.0 5 212.0 22.0 13.0 \n",
+ "1 200002 -1.0 0.0 1 238.0 1.0 0.0 \n",
+ "2 200003 4.0 1.0 4 221.0 4.0 1.0 \n",
+ "3 200004 -1.0 2.0 1 52.0 0.0 0.0 \n",
+ "4 200005 2.0 0.0 4 106.0 2.0 3.0 \n",
+ "\n",
+ " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n",
+ "0 1.0 0.0 414.0 0.045455 0.004717 \n",
+ "1 0.0 0.0 484.0 0.000000 0.000000 \n",
+ "2 0.0 1.0 420.0 0.000000 0.000000 \n",
+ "3 0.0 0.0 61.0 NaN 0.000000 \n",
+ "4 1.0 2.0 161.0 0.500000 0.009434 \n",
+ "\n",
+ " buy_click_ratio buy_favor_ratio \n",
+ "0 0.002415 1.0 \n",
+ "1 0.000000 NaN \n",
+ "2 0.000000 0.0 \n",
+ "3 0.000000 NaN \n",
+ "4 0.006211 0.5 "
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_table = pd.read_csv(USER_TABLE_FILE)\n",
+ "user_table.head()"
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,