From 2d6d13308e7e56a6d36f65c98dca5a4ae14dc2a0 Mon Sep 17 00:00:00 2001
From: benjas <909336740@qq.com>
Date: Sun, 14 Feb 2021 09:54:01 +0800
Subject: [PATCH] =?UTF-8?q?Delete=201-=E6=95=B0=E6=8D=AE=E6=B8=85=E6=B4=97?=
=?UTF-8?q?-checkpoint.ipynb?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
---
.../1-数据清洗-checkpoint.ipynb | 3347 -----------------
1 file changed, 3347 deletions(-)
delete mode 100644 机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/1-数据清洗-checkpoint.ipynb
diff --git a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/1-数据清洗-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/1-数据清洗-checkpoint.ipynb
deleted file mode 100644
index 82734f5..0000000
--- a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/1-数据清洗-checkpoint.ipynb
+++ /dev/null
@@ -1,3347 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 任务:京东用户购买意向预测"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 故事背景:\n",
- "京东作为中国最大的自营式电商,在保持高速发展的同时,沉淀了数亿的忠实用户,积累了海量的真实数据。如何从历史数据中找出规律,去预测用户未来的购买需求,让最合适的商品遇见最需要的人,是大数据应用在精准营销中的关键问题,也是所有电商平台在做智能化升级时所需要的核心技术。\n",
- "\n",
- "以京东商城真实的用户、商品和行为数据(脱敏后)为基础,通过数据挖掘的技术和机器学习的算法,构建用户购买商品的预测模型,输出高潜用户和目标商品的匹配结果,为精准营销提供高质量的目标群体。"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 目标:\n",
- "使用京东多个品类下商品的历史销售数据,构建算法模型,预测用户在未来5天内,对某个目标品类下商品的购买意向。"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 数据集:\n",
- "这里涉及到的数据集是京东的数据集:\n",
- "\n",
- "* JData_User.csv 用户数据集 105,321个用户\n",
- "* JData_Comment.csv 商品评论 558,552条记录\n",
- "* JData_Product.csv 预测商品集合 24,187条记录\n",
- "* JData_Action_201602.csv 2月份行为交互记录 11,485,424条记录\n",
- "* JData_Action_201603.csv 3月份行为交互记录 25,916,378条记录\n",
- "* JData_Action_201604.csv 4月份行为交互记录 13,199,934条记录"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**JData_User.csv用户数据**\n",
- "\n",
- "|字段|意义|备注|\n",
- "|-|-|-|\n",
- "|user_id|用户id|脱敏|\n",
- "|age|年龄|-1表未知|\n",
- "|sex|性别|0男,1女,2未知|\n",
- "|user_lv_cd|用户等级|级别枚举,越高级别越大|\n",
- "|user_reg_tm|用户注册日期|粒度到天|"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**JData_Comment.csv评论数据**\n",
- "\n",
- "|字段|意义|备注|\n",
- "|-|-|-|\n",
- "|dt|截止时间|天,到2016-02-01|\n",
- "|sku_id|商品编号|脱敏|\n",
- "|comment_num|累积评论数分段|0表示无评论,1表是1条,2表示2-10条,3表示11-50条,4表示大于50条|\n",
- "|has_bad_comment|是否有差评|0表示无,1表示有|\n",
- "|bad_comment_rate|差评率|差评数占总评论数的比率|"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**JData_Product.csv商品数据**\n",
- "\n",
- "|字段|意义|备注|\n",
- "|-|-|-|\n",
- "|sku_id|商品编号|脱敏|\n",
- "|a1|属性1|枚举,-1表未知|\n",
- "|a2|属性2|枚举,-1表未知|\n",
- "|a3|属性3|枚举,-1表未知|\n",
- "|cate|品牌ID|脱敏|\n",
- "|brand|品牌ID|脱敏|"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**JData_Action_xx.csv商品数据**\n",
- "\n",
- "|字段|意义|备注|\n",
- "|-|-|-|\n",
- "|user_id|用户ID|脱敏|\n",
- "|sku_id|商品编号|脱敏|\n",
- "|time|行为时间||\n",
- "|model_id|点击板块的编号|脱敏|\n",
- "|type|行为类型|1.浏览商品详情页;2.加入购物车;3.购物车删除;4.下单;5.关注;6.点击;|\n",
- "|cate|品牌ID|脱敏|\n",
- "|brand|品牌ID|脱敏|"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 数据挖掘流程:\n",
- "(一).数据清洗\n",
- "1. 数据集完整性验证\n",
- "2. 数据集中是否存在缺失值\n",
- "3. 数据集中各特征数值应该如何处理\n",
- "4. 哪些数据是我们想要的,哪些是可以过滤掉的\n",
- "5. 将有价值数据信息做成新的数据源\n",
- "6. 去除无行为交互的商品和用户\n",
- "7. 去掉浏览量很大而购买量很少的用户(惰性用户或爬虫用户)\n",
- "\n",
- "(二).数据理解与分析\n",
- "1. 掌握各个特征的含义\n",
- "2. 观察数据有哪些特点,是否可利用来建模\n",
- "3. 可视化展示便于分析\n",
- "4. 用户的购买意向是否随着时间等因素变化\n",
- "(三).特征提取\n",
- "1. 基于清洗后的数据集哪些特征是有价值\n",
- "2. 分别对用户与商品以及其之间构成的行为进行特征提取\n",
- "3. 行为因素中哪些是核心?如何提取?\n",
- "4. 瞬时行为特征or累计行为特征?\n",
- "\n",
- "(四).模型建立\n",
- "1. 使用机器学习算法进行预测\n",
- "2. 参数设置与调节\n",
- "3. 数据集切分"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 数据集完整性验证\n",
- "首先检查JData_User中的用户和JData_Dction中的用户是否一致,保证行为数据中锁产生的行为均由用户数据中的用户产生。\n",
- "\n",
- "思路:利用pd.Merge连接sku和Action中的sku,观测Action中的数据是否减少Example:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- " sku data\n",
- "0 a 1\n",
- "1 a 1\n",
- "2 c 3\n"
- ]
- }
- ],
- "source": [
- "# 测试方法\n",
- "import pandas as pd\n",
- "df1 = pd.DataFrame({'sku':['a','a','e','c'], 'data':[1,1,2,3]})\n",
- "df2 = pd.DataFrame({'sku':['a','b','c']})\n",
- "print(pd.merge(df1,df2))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "结果只会打印两者共有的部分"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Is action of Feb. from User file? True\n",
- "Is action of Mar. from User file? True\n",
- "Is action of Apr. from User file? True\n"
- ]
- }
- ],
- "source": [
- "#数据集验证\n",
- "def user_action_check():\n",
- " df_user = pd.read_csv('data/JData_User.csv',encoding='gbk')\n",
- " df_sku = df_user.loc[:,'user_id'].to_frame()\n",
- " df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n",
- " # pd.merge(df_sku,df_month2) 会以user_id字段为基准取两个df的交集 不是取并集,这样才能证明 action中的userid 都在df_user里面\n",
- " print ('Is action of Feb. from User file? ', len(df_month2) == len(pd.merge(df_sku,df_month2))) \n",
- " df_month3 = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n",
- " print ('Is action of Mar. from User file? ', len(df_month3) == len(pd.merge(df_sku,df_month3)))\n",
- " df_month4 = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n",
- " print ('Is action of Apr. from User file? ', len(df_month4) == len(pd.merge(df_sku,df_month4)))\n",
- "\n",
- "user_action_check() "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "结论:User数据集中的用户和交互行为数据集中的用户完全一致\n",
- "\n",
- "根据merge前后的数据量对,能保障Action中的用户ID是User中的ID的子集"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 检查是否有重复记录\n",
- "除去各个数据文件中完全重复的记录,可能解释是重复数据是有意义的,比如用户同时购买多件商品,同时添加多个数量的商品到购物车等…"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "#重复数据\n",
- "def deduplicate(filepath, filename, newpath):\n",
- " df_file = pd.read_csv(filepath,encoding='gbk') \n",
- " before = df_file.shape[0]\n",
- " df_file.drop_duplicates(inplace=True) # 列相同认为是重复 inplace=True表示在原来的DataFrame上删除重复项4\n",
- " after = df_file.shape[0]\n",
- " n_dup = before-after # 查看前后差值\n",
- " print ('Number of duplicate records for ' + filename + ' is: ' + str(n_dup))\n",
- " if n_dup != 0:\n",
- " df_file.to_csv(newpath, index=None)\n",
- " else:\n",
- " print ('Number duplicate records in ' + filename)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of duplicate records for Feb. action is: 2756093\n",
- "Number of duplicate records for Mar. action is: 7085038\n",
- "Number of duplicate records for Feb. action is: 3672710\n",
- "Number of duplicate records for Comment is: 0\n",
- "Number duplicate records in Comment\n",
- "Number of duplicate records for Product is: 0\n",
- "Number duplicate records in Product\n",
- "Number of duplicate records for User is: 0\n",
- "Number duplicate records in User\n"
- ]
- }
- ],
- "source": [
- "deduplicate('data/JData_Action_201602.csv', 'Feb. action', 'data/JData_Action_201602_dedup.csv')\n",
- "deduplicate('data/JData_Action_201603.csv', 'Mar. action', 'data/JData_Action_201603_dedup.csv')\n",
- "deduplicate('data/JData_Action_201604.csv', 'Feb. action', 'data/JData_Action_201604_dedup.csv')\n",
- "deduplicate('data/JData_Comment.csv', 'Comment', 'data/JData_Comment_dedup.csv')\n",
- "deduplicate('data/JData_Product.csv', 'Product', 'data/JData_Product_dedup.csv')\n",
- "deduplicate('data/JData_User.csv', 'User', 'data/JData_User_dedup.csv')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " sku_id | \n",
- " time | \n",
- " model_id | \n",
- " cate | \n",
- " brand | \n",
- "
\n",
- " \n",
- " type | \n",
- " | \n",
- " | \n",
- " | \n",
- " | \n",
- " | \n",
- " | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 1 | \n",
- " 2176378 | \n",
- " 2176378 | \n",
- " 2176378 | \n",
- " 0 | \n",
- " 2176378 | \n",
- " 2176378 | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 636 | \n",
- " 636 | \n",
- " 636 | \n",
- " 0 | \n",
- " 636 | \n",
- " 636 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 1464 | \n",
- " 1464 | \n",
- " 1464 | \n",
- " 0 | \n",
- " 1464 | \n",
- " 1464 | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 37 | \n",
- " 37 | \n",
- " 37 | \n",
- " 0 | \n",
- " 37 | \n",
- " 37 | \n",
- "
\n",
- " \n",
- " 5 | \n",
- " 1981 | \n",
- " 1981 | \n",
- " 1981 | \n",
- " 0 | \n",
- " 1981 | \n",
- " 1981 | \n",
- "
\n",
- " \n",
- " 6 | \n",
- " 575597 | \n",
- " 575597 | \n",
- " 575597 | \n",
- " 545054 | \n",
- " 575597 | \n",
- " 575597 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id sku_id time model_id cate brand\n",
- "type \n",
- "1 2176378 2176378 2176378 0 2176378 2176378\n",
- "2 636 636 636 0 636 636\n",
- "3 1464 1464 1464 0 1464 1464\n",
- "4 37 37 37 0 37 37\n",
- "5 1981 1981 1981 0 1981 1981\n",
- "6 575597 575597 575597 545054 575597 575597"
- ]
- },
- "execution_count": 6,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# 查看重复数据\n",
- "df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n",
- "IsDuplicated = df_month2.duplicated()\n",
- "df_d = df_month2[IsDuplicated]\n",
- "df_d.groupby('type').count() # 发现重复数据大多数都是由于浏览(1),或者点击(6)产生"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 检查是否存在注册时间在2016年-4月-15号之后的用户\n",
- "统计的是4月15号前的客户行为,不应该包含4月15号后的注册客户。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " age | \n",
- " sex | \n",
- " user_lv_cd | \n",
- " user_reg_tm | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 7457 | \n",
- " 207458 | \n",
- " -1 | \n",
- " 2.0 | \n",
- " 1 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7463 | \n",
- " 207464 | \n",
- " 26-35岁 | \n",
- " 2.0 | \n",
- " 2 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7467 | \n",
- " 207468 | \n",
- " 36-45岁 | \n",
- " 2.0 | \n",
- " 3 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7472 | \n",
- " 207473 | \n",
- " -1 | \n",
- " 2.0 | \n",
- " 1 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7482 | \n",
- " 207483 | \n",
- " 26-35岁 | \n",
- " 2.0 | \n",
- " 3 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7492 | \n",
- " 207493 | \n",
- " 16-25岁 | \n",
- " 2.0 | \n",
- " 3 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7493 | \n",
- " 207494 | \n",
- " 16-25岁 | \n",
- " 2.0 | \n",
- " 3 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7503 | \n",
- " 207504 | \n",
- " 16-25岁 | \n",
- " 2.0 | \n",
- " 4 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7510 | \n",
- " 207511 | \n",
- " 46-55岁 | \n",
- " 2.0 | \n",
- " 5 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7512 | \n",
- " 207513 | \n",
- " -1 | \n",
- " 2.0 | \n",
- " 1 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7518 | \n",
- " 207519 | \n",
- " 26-35岁 | \n",
- " 2.0 | \n",
- " 2 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7521 | \n",
- " 207522 | \n",
- " 26-35岁 | \n",
- " 0.0 | \n",
- " 3 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7525 | \n",
- " 207526 | \n",
- " -1 | \n",
- " 2.0 | \n",
- " 3 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7533 | \n",
- " 207534 | \n",
- " -1 | \n",
- " 2.0 | \n",
- " 1 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7543 | \n",
- " 207544 | \n",
- " 26-35岁 | \n",
- " 2.0 | \n",
- " 3 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7544 | \n",
- " 207545 | \n",
- " -1 | \n",
- " 2.0 | \n",
- " 1 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7551 | \n",
- " 207552 | \n",
- " 26-35岁 | \n",
- " 2.0 | \n",
- " 3 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 7553 | \n",
- " 207554 | \n",
- " 16-25岁 | \n",
- " 2.0 | \n",
- " 4 | \n",
- " 2016-04-15 | \n",
- "
\n",
- " \n",
- " 8545 | \n",
- " 208546 | \n",
- " 16-25岁 | \n",
- " 0.0 | \n",
- " 2 | \n",
- " 2016-04-29 | \n",
- "
\n",
- " \n",
- " 9394 | \n",
- " 209395 | \n",
- " 16-25岁 | \n",
- " 1.0 | \n",
- " 2 | \n",
- " 2016-05-11 | \n",
- "
\n",
- " \n",
- " 10362 | \n",
- " 210363 | \n",
- " 56岁以上 | \n",
- " 2.0 | \n",
- " 2 | \n",
- " 2016-05-24 | \n",
- "
\n",
- " \n",
- " 10367 | \n",
- " 210368 | \n",
- " -1 | \n",
- " 2.0 | \n",
- " 1 | \n",
- " 2016-05-24 | \n",
- "
\n",
- " \n",
- " 11019 | \n",
- " 211020 | \n",
- " 36-45岁 | \n",
- " 2.0 | \n",
- " 3 | \n",
- " 2016-06-06 | \n",
- "
\n",
- " \n",
- " 12014 | \n",
- " 212015 | \n",
- " 36-45岁 | \n",
- " 2.0 | \n",
- " 2 | \n",
- " 2016-07-05 | \n",
- "
\n",
- " \n",
- " 13850 | \n",
- " 213851 | \n",
- " 26-35岁 | \n",
- " 2.0 | \n",
- " 3 | \n",
- " 2016-09-11 | \n",
- "
\n",
- " \n",
- " 14542 | \n",
- " 214543 | \n",
- " -1 | \n",
- " 2.0 | \n",
- " 1 | \n",
- " 2016-10-05 | \n",
- "
\n",
- " \n",
- " 16746 | \n",
- " 216747 | \n",
- " 16-25岁 | \n",
- " 2.0 | \n",
- " 1 | \n",
- " 2016-11-25 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id age sex user_lv_cd user_reg_tm\n",
- "7457 207458 -1 2.0 1 2016-04-15\n",
- "7463 207464 26-35岁 2.0 2 2016-04-15\n",
- "7467 207468 36-45岁 2.0 3 2016-04-15\n",
- "7472 207473 -1 2.0 1 2016-04-15\n",
- "7482 207483 26-35岁 2.0 3 2016-04-15\n",
- "7492 207493 16-25岁 2.0 3 2016-04-15\n",
- "7493 207494 16-25岁 2.0 3 2016-04-15\n",
- "7503 207504 16-25岁 2.0 4 2016-04-15\n",
- "7510 207511 46-55岁 2.0 5 2016-04-15\n",
- "7512 207513 -1 2.0 1 2016-04-15\n",
- "7518 207519 26-35岁 2.0 2 2016-04-15\n",
- "7521 207522 26-35岁 0.0 3 2016-04-15\n",
- "7525 207526 -1 2.0 3 2016-04-15\n",
- "7533 207534 -1 2.0 1 2016-04-15\n",
- "7543 207544 26-35岁 2.0 3 2016-04-15\n",
- "7544 207545 -1 2.0 1 2016-04-15\n",
- "7551 207552 26-35岁 2.0 3 2016-04-15\n",
- "7553 207554 16-25岁 2.0 4 2016-04-15\n",
- "8545 208546 16-25岁 0.0 2 2016-04-29\n",
- "9394 209395 16-25岁 1.0 2 2016-05-11\n",
- "10362 210363 56岁以上 2.0 2 2016-05-24\n",
- "10367 210368 -1 2.0 1 2016-05-24\n",
- "11019 211020 36-45岁 2.0 3 2016-06-06\n",
- "12014 212015 36-45岁 2.0 2 2016-07-05\n",
- "13850 213851 26-35岁 2.0 3 2016-09-11\n",
- "14542 214543 -1 2.0 1 2016-10-05\n",
- "16746 216747 16-25岁 2.0 1 2016-11-25"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# check user who’s user_reg_tm >= '2016-4-15'\n",
- "df_user = pd.read_csv('./data/JData_User.csv',encoding='gbk')\n",
- "df_user['user_reg_tm']=pd.to_datetime(df_user['user_reg_tm']) \n",
- "df_user.loc[df_user.user_reg_tm>= '2016-4-15']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "检查依然存在4月15号后注册的,如果这些客户没有4月15号后的行为数据,说明要删除。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " sku_id | \n",
- " time | \n",
- " model_id | \n",
- " type | \n",
- " cate | \n",
- " brand | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- "Empty DataFrame\n",
- "Columns: [user_id, sku_id, time, model_id, type, cate, brand]\n",
- "Index: []"
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df_month = pd.read_csv('data/JData_Action_201604.csv')\n",
- "df_month['time'] = pd.to_datetime(df_month['time'])\n",
- "df_month.loc[df_month.time >= '2016-4-16']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "说明客户没有交互数据,所以这一批客户不需要删除"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 行为数据中的user_id为浮点型,进行INT类型转换"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "int64\n",
- "int64\n",
- "int64\n"
- ]
- }
- ],
- "source": [
- "df_month = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n",
- "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
- "print (df_month['user_id'].dtype)\n",
- "df_month.to_csv('data/JData_Action_201602.csv',index=None)\n",
- " \n",
- "df_month = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n",
- "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
- "print (df_month['user_id'].dtype)\n",
- "df_month.to_csv('data/JData_Action_201603.csv',index=None)\n",
- " \n",
- "df_month = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n",
- "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n",
- "print (df_month['user_id'].dtype)\n",
- "df_month.to_csv('data/JData_Action_201604.csv',index=None)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 年龄区间的处理\n",
- "查看用户年龄分布,并做特征编码"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- " 3.0 46570\n",
- " 4.0 30336\n",
- "-1.0 14412\n",
- " 2.0 8797\n",
- " 5.0 3325\n",
- " 6.0 1871\n",
- " 1.0 7\n",
- "Name: age, dtype: int64\n"
- ]
- }
- ],
- "source": [
- "age_mapping = { \n",
- " '15岁以下': 1, \n",
- " '16-25岁': 2, \n",
- " '26-35岁': 3,\n",
- " '36-45岁': 4,\n",
- " '46-55岁': 5,\n",
- " '56岁以上': 6,\n",
- " '-1' :-1\n",
- " } \n",
- "df_user['age'] = df_user['age'].map(age_mapping)\n",
- "print(df_user.age.value_counts())\n",
- "df_user.to_csv('data\\JData_User.csv',index=None)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "为了能够进行上述清洗,在此首先构造了简单的用户(user)行为特征和商品(item)行为特征,对应于两张表user_table和item_table."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### user_table特征包括:\n",
- "* user_table特征包括:\n",
- "* user_id(用户id),age(年龄),sex(性别),\n",
- "* user_lv_cd(用户级别),browse_num(浏览数),\n",
- "* addcart_num(加购数),delcart_num(删购数),\n",
- "* buy_num(购买数),favor_num(收藏数),\n",
- "* click_num(点击数),buy_addcart_ratio(购买加购转化率),\n",
- "* buy_browse_ratio(购买浏览转化率),\n",
- "* buy_click_ratio(购买点击转化率),\n",
- "* buy_favor_ratio(购买收藏转化率)\n",
- "\n",
- "### item_table特征包括:\n",
- "* sku_id(商品id),attr1,attr2,\n",
- "* attr3,cate,brand,browse_num,\n",
- "* addcart_num,delcart_num,\n",
- "* buy_num,favor_num,click_num,\n",
- "* buy_addcart_ratio,buy_browse_ratio,\n",
- "* buy_click_ratio,buy_favor_ratio,\n",
- "* comment_num(评论数),\n",
- "* has_bad_comment(是否有差评),\n",
- "* bad_comment_rate(差评率)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 构建User_table"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 定义文件名\n",
- "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n",
- "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n",
- "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n",
- "COMMENT_FILE = \"data/JData_Comment.csv\"\n",
- "PRODUCT_FILE = \"data/JData_Product.csv\"\n",
- "USER_FILE = \"data/JData_User.csv\"\n",
- "\n",
- "USER_TABLE_FILE = \"data/user_table.csv\"\n",
- "ITEM_TABLE_FILE = \"data/item_table.csv\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "from collections import Counter"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 功能函数: 对每一个user分组的数据进行统计\n",
- "def add_type_count(group):\n",
- " behavior_type = group.type.astype(int) \n",
- " # 用户行为类别\n",
- " type_cnt = Counter(behavior_type)\n",
- " # 1: 浏览 2: 加购 3: 删除\n",
- " # 4: 购买 5: 收藏 6: 点击\n",
- " group['browse_num'] = type_cnt[1]\n",
- " group['addcart_num'] = type_cnt[2]\n",
- " group['delcart_num'] = type_cnt[3]\n",
- " group['buy_num'] = type_cnt[4]\n",
- " group['favor_num'] = type_cnt[5]\n",
- " group['click_num'] = type_cnt[6]\n",
- " \n",
- " return group[['user_id', 'browse_num', 'addcart_num',\n",
- " 'delcart_num', 'buy_num', 'favor_num',\n",
- " 'click_num']]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "由于用户行为数据量较大,一次性读入可能造成内存错误(Memory Error),因此使用pandas的分块(chunk)读取"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [],
- "source": [
- "#对action数据进行统计\n",
- "#根据自己调节chunk_size大小\n",
- "def get_from_action_data(fname, chunk_size=50000):\n",
- " reader = pd.read_csv(fname, header=0, iterator=True,encoding='gbk')\n",
- " chunks = []\n",
- " loop = True\n",
- " while loop:\n",
- " try:\n",
- " # 只读取user_id和type两个字段\n",
- " chunk = reader.get_chunk(chunk_size)[[\"user_id\", \"type\"]]\n",
- " chunks.append(chunk)\n",
- " except StopIteration:\n",
- " loop = False\n",
- " print(\"Iteration is stopped\")\n",
- " # 将块拼接为pandas dataframe格式\n",
- " df_ac = pd.concat(chunks, ignore_index=True)\n",
- " # 按user_id分组,对每一组进行统计,as_index 表示无索引形式返回数据\n",
- " df_ac = df_ac.groupby(['user_id'], as_index=False).apply(add_type_count)\n",
- " # 将重复的行丢弃\n",
- " df_ac = df_ac.drop_duplicates('user_id')\n",
- " \n",
- " return df_ac"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 将各个action数据的统计量进行聚合\n",
- "def merge_action_data():\n",
- " df_ac = []\n",
- " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n",
- " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n",
- " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n",
- " \n",
- " df_ac = pd.concat(df_ac, ignore_index=True)\n",
- " # 用户在不同action表中统计量求和\n",
- " df_ac = df_ac.groupby(['user_id'], as_index=False).sum()\n",
- " # 构造转化率字段\n",
- " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n",
- " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n",
- " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n",
- " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n",
- " \n",
- " # 将大于1的转化率字段置为1(100%)\n",
- " df_ac.loc[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n",
- " df_ac.loc[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n",
- " df_ac.loc[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n",
- " df_ac.loc[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n",
- " \n",
- " return df_ac"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 从FJData_User表中抽取需要的字段\n",
- "def get_from_jdata_user():\n",
- " df_usr = pd.read_csv(USER_FILE, header=0,encoding='gbk')\n",
- " df_usr = df_usr[[\"user_id\", \"age\", \"sex\", \"user_lv_cd\"]]\n",
- " return df_usr"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " age | \n",
- " sex | \n",
- " user_lv_cd | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 200001 | \n",
- " 6.0 | \n",
- " 2.0 | \n",
- " 5 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 200002 | \n",
- " -1.0 | \n",
- " 0.0 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 200003 | \n",
- " 4.0 | \n",
- " 1.0 | \n",
- " 4 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 200004 | \n",
- " -1.0 | \n",
- " 2.0 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 200005 | \n",
- " 2.0 | \n",
- " 0.0 | \n",
- " 4 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id age sex user_lv_cd\n",
- "0 200001 6.0 2.0 5\n",
- "1 200002 -1.0 0.0 1\n",
- "2 200003 4.0 1.0 4\n",
- "3 200004 -1.0 2.0 1\n",
- "4 200005 2.0 0.0 4"
- ]
- },
- "execution_count": 20,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_base = get_from_jdata_user()\n",
- "user_base.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Iteration is stopped\n",
- "Iteration is stopped\n",
- "Iteration is stopped\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " browse_num | \n",
- " addcart_num | \n",
- " delcart_num | \n",
- " buy_num | \n",
- " favor_num | \n",
- " click_num | \n",
- " buy_addcart_ratio | \n",
- " buy_browse_ratio | \n",
- " buy_click_ratio | \n",
- " buy_favor_ratio | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 200001 | \n",
- " 212 | \n",
- " 22 | \n",
- " 13 | \n",
- " 1 | \n",
- " 0 | \n",
- " 414 | \n",
- " 0.045455 | \n",
- " 0.004717 | \n",
- " 0.002415 | \n",
- " 1.0 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 200002 | \n",
- " 238 | \n",
- " 1 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 484 | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 200003 | \n",
- " 221 | \n",
- " 4 | \n",
- " 1 | \n",
- " 0 | \n",
- " 1 | \n",
- " 420 | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " 0.0 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 200004 | \n",
- " 52 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 61 | \n",
- " NaN | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 200005 | \n",
- " 106 | \n",
- " 2 | \n",
- " 3 | \n",
- " 1 | \n",
- " 2 | \n",
- " 161 | \n",
- " 0.500000 | \n",
- " 0.009434 | \n",
- " 0.006211 | \n",
- " 0.5 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id browse_num addcart_num delcart_num buy_num favor_num \\\n",
- "0 200001 212 22 13 1 0 \n",
- "1 200002 238 1 0 0 0 \n",
- "2 200003 221 4 1 0 1 \n",
- "3 200004 52 0 0 0 0 \n",
- "4 200005 106 2 3 1 2 \n",
- "\n",
- " click_num buy_addcart_ratio buy_browse_ratio buy_click_ratio \\\n",
- "0 414 0.045455 0.004717 0.002415 \n",
- "1 484 0.000000 0.000000 0.000000 \n",
- "2 420 0.000000 0.000000 0.000000 \n",
- "3 61 NaN 0.000000 0.000000 \n",
- "4 161 0.500000 0.009434 0.006211 \n",
- "\n",
- " buy_favor_ratio \n",
- "0 1.0 \n",
- "1 NaN \n",
- "2 0.0 \n",
- "3 NaN \n",
- "4 0.5 "
- ]
- },
- "execution_count": 21,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_behavior = merge_action_data()\n",
- "user_behavior.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 连接成一张表,类似于SQL的左连接(left join)\n",
- "user_behavior = pd.merge(user_base, user_behavior, on=['user_id'], how='left')\n",
- "# 保存为user_table.csv\n",
- "user_behavior.to_csv(USER_TABLE_FILE, index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " age | \n",
- " sex | \n",
- " user_lv_cd | \n",
- " browse_num | \n",
- " addcart_num | \n",
- " delcart_num | \n",
- " buy_num | \n",
- " favor_num | \n",
- " click_num | \n",
- " buy_addcart_ratio | \n",
- " buy_browse_ratio | \n",
- " buy_click_ratio | \n",
- " buy_favor_ratio | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 200001 | \n",
- " 6.0 | \n",
- " 2.0 | \n",
- " 5 | \n",
- " 212.0 | \n",
- " 22.0 | \n",
- " 13.0 | \n",
- " 1.0 | \n",
- " 0.0 | \n",
- " 414.0 | \n",
- " 0.045455 | \n",
- " 0.004717 | \n",
- " 0.002415 | \n",
- " 1.0 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 200002 | \n",
- " -1.0 | \n",
- " 0.0 | \n",
- " 1 | \n",
- " 238.0 | \n",
- " 1.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 484.0 | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 200003 | \n",
- " 4.0 | \n",
- " 1.0 | \n",
- " 4 | \n",
- " 221.0 | \n",
- " 4.0 | \n",
- " 1.0 | \n",
- " 0.0 | \n",
- " 1.0 | \n",
- " 420.0 | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " 0.0 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 200004 | \n",
- " -1.0 | \n",
- " 2.0 | \n",
- " 1 | \n",
- " 52.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 61.0 | \n",
- " NaN | \n",
- " 0.000000 | \n",
- " 0.000000 | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 200005 | \n",
- " 2.0 | \n",
- " 0.0 | \n",
- " 4 | \n",
- " 106.0 | \n",
- " 2.0 | \n",
- " 3.0 | \n",
- " 1.0 | \n",
- " 2.0 | \n",
- " 161.0 | \n",
- " 0.500000 | \n",
- " 0.009434 | \n",
- " 0.006211 | \n",
- " 0.5 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id age sex user_lv_cd browse_num addcart_num delcart_num \\\n",
- "0 200001 6.0 2.0 5 212.0 22.0 13.0 \n",
- "1 200002 -1.0 0.0 1 238.0 1.0 0.0 \n",
- "2 200003 4.0 1.0 4 221.0 4.0 1.0 \n",
- "3 200004 -1.0 2.0 1 52.0 0.0 0.0 \n",
- "4 200005 2.0 0.0 4 106.0 2.0 3.0 \n",
- "\n",
- " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n",
- "0 1.0 0.0 414.0 0.045455 0.004717 \n",
- "1 0.0 0.0 484.0 0.000000 0.000000 \n",
- "2 0.0 1.0 420.0 0.000000 0.000000 \n",
- "3 0.0 0.0 61.0 NaN 0.000000 \n",
- "4 1.0 2.0 161.0 0.500000 0.009434 \n",
- "\n",
- " buy_click_ratio buy_favor_ratio \n",
- "0 0.002415 1.0 \n",
- "1 0.000000 NaN \n",
- "2 0.000000 0.0 \n",
- "3 0.000000 NaN \n",
- "4 0.006211 0.5 "
- ]
- },
- "execution_count": 23,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_table = pd.read_csv(USER_TABLE_FILE)\n",
- "user_table.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 构建Item_table"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 定义文件名\n",
- "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n",
- "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n",
- "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n",
- "COMMENT_FILE = \"data/JData_Comment.csv\"\n",
- "PRODUCT_FILE = \"data/JData_Product.csv\"\n",
- "USER_FILE = \"data/JData_User.csv\"\n",
- "\n",
- "USER_TABLE_FILE = \"data/user_table.csv\"\n",
- "ITEM_TABLE_FILE = \"data/item_table.csv\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "import numpy as np\n",
- "from collections import Counter"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 读取Product中商品\n",
- "def get_from_jdata_product():\n",
- " df_item = pd.read_csv(PRODUCT_FILE, header=0,encoding='gbk')\n",
- " return df_item"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 功能函数: 对每一个商品分组的数据进行统计\n",
- "def add_type_count(group):\n",
- " behavior_type = group.type.astype(int) \n",
- " type_cnt = Counter(behavior_type)\n",
- " \n",
- " group['browse_num'] = type_cnt[1]\n",
- " group['addcart_num'] = type_cnt[2]\n",
- " group['delcart_num'] = type_cnt[3]\n",
- " group['buy_num'] = type_cnt[4]\n",
- " group['favor_num'] = type_cnt[5]\n",
- " group['click_num'] = type_cnt[6]\n",
- " \n",
- " return group[['sku_id', 'browse_num', 'addcart_num',\n",
- " 'delcart_num', 'buy_num', 'favor_num',\n",
- " 'click_num']]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 对action数据进行统计\n",
- "def get_from_action_data(fname, chunk_size=50000):\n",
- " reader = pd.read_csv(fname, header=0, iterator=True,encoding='gbk')\n",
- " chunks = []\n",
- " loop = True\n",
- " while loop:\n",
- " try:\n",
- " chunk = reader.get_chunk(chunk_size)[[\"sku_id\", \"type\"]]\n",
- " chunks.append(chunk)\n",
- " except StopIteration:\n",
- " loop = False\n",
- " print(\"Iteration is stopped\")\n",
- " \n",
- " df_ac = pd.concat(chunks, ignore_index=True)\n",
- " df_ac = df_ac.groupby(['sku_id'], as_index=False).apply(add_type_count)\n",
- " df_ac = df_ac.drop_duplicates('sku_id')\n",
- " \n",
- " return df_ac"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 获取评论中的商品数据,如果存在某一个商品有两个日期的评论,我们取最晚的那一个\n",
- "def get_from_jdata_comment():\n",
- " df_cmt = pd.read_csv(COMMENT_FILE, header=0)\n",
- " df_cmt['dt'] = pd.to_datetime(df_cmt['dt'])\n",
- " # find latest comment index\n",
- " idx = df_cmt.groupby(['sku_id'])['dt'].transform(max) == df_cmt['dt'] # 取最晚的那一个??\n",
- " df_cmt = df_cmt[idx]\n",
- " \n",
- " return df_cmt[['sku_id', 'comment_num',\n",
- " 'has_bad_comment', 'bad_comment_rate']]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [],
- "source": [
- "def merge_action_data():\n",
- " df_ac = []\n",
- " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n",
- " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n",
- " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n",
- " \n",
- " df_ac = pd.concat(df_ac, ignore_index=True)\n",
- " df_ac = df_ac.groupby(['sku_id'], as_index=False).sum()\n",
- "\n",
- " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n",
- " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n",
- " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n",
- " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n",
- " \n",
- " df_ac.loc[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n",
- " df_ac.loc[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n",
- " df_ac.loc[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n",
- " df_ac.loc[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n",
- " \n",
- " return df_ac"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " sku_id | \n",
- " a1 | \n",
- " a2 | \n",
- " a3 | \n",
- " cate | \n",
- " brand | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 10 | \n",
- " 3 | \n",
- " 1 | \n",
- " 1 | \n",
- " 8 | \n",
- " 489 | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 100002 | \n",
- " 3 | \n",
- " 2 | \n",
- " 2 | \n",
- " 8 | \n",
- " 489 | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 100003 | \n",
- " 1 | \n",
- " -1 | \n",
- " -1 | \n",
- " 8 | \n",
- " 30 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 100006 | \n",
- " 1 | \n",
- " 2 | \n",
- " 1 | \n",
- " 8 | \n",
- " 545 | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 10001 | \n",
- " -1 | \n",
- " 1 | \n",
- " 2 | \n",
- " 8 | \n",
- " 244 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " sku_id a1 a2 a3 cate brand\n",
- "0 10 3 1 1 8 489\n",
- "1 100002 3 2 2 8 489\n",
- "2 100003 1 -1 -1 8 30\n",
- "3 100006 1 2 1 8 545\n",
- "4 10001 -1 1 2 8 244"
- ]
- },
- "execution_count": 16,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_base = get_from_jdata_product()\n",
- "item_base.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Iteration is stopped\n",
- "Iteration is stopped\n",
- "Iteration is stopped\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " sku_id | \n",
- " browse_num | \n",
- " addcart_num | \n",
- " delcart_num | \n",
- " buy_num | \n",
- " favor_num | \n",
- " click_num | \n",
- " buy_addcart_ratio | \n",
- " buy_browse_ratio | \n",
- " buy_click_ratio | \n",
- " buy_favor_ratio | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 2 | \n",
- " 55 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 79 | \n",
- " NaN | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 18 | \n",
- " 2 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 2 | \n",
- " NaN | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 36 | \n",
- " 107 | \n",
- " 4 | \n",
- " 0 | \n",
- " 0 | \n",
- " 1 | \n",
- " 186 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 37 | \n",
- " 5 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 0 | \n",
- " 10 | \n",
- " NaN | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 40 | \n",
- " 79 | \n",
- " 2 | \n",
- " 2 | \n",
- " 0 | \n",
- " 0 | \n",
- " 179 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " 0.0 | \n",
- " NaN | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " sku_id browse_num addcart_num delcart_num buy_num favor_num \\\n",
- "0 2 55 0 0 0 0 \n",
- "1 18 2 0 0 0 0 \n",
- "2 36 107 4 0 0 1 \n",
- "3 37 5 0 0 0 0 \n",
- "4 40 79 2 2 0 0 \n",
- "\n",
- " click_num buy_addcart_ratio buy_browse_ratio buy_click_ratio \\\n",
- "0 79 NaN 0.0 0.0 \n",
- "1 2 NaN 0.0 0.0 \n",
- "2 186 0.0 0.0 0.0 \n",
- "3 10 NaN 0.0 0.0 \n",
- "4 179 0.0 0.0 0.0 \n",
- "\n",
- " buy_favor_ratio \n",
- "0 NaN \n",
- "1 NaN \n",
- "2 0.0 \n",
- "3 NaN \n",
- "4 NaN "
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_behavior = merge_action_data()\n",
- "item_behavior.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " sku_id | \n",
- " comment_num | \n",
- " has_bad_comment | \n",
- " bad_comment_rate | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 512006 | \n",
- " 1000 | \n",
- " 3 | \n",
- " 1 | \n",
- " 0.0417 | \n",
- "
\n",
- " \n",
- " 512007 | \n",
- " 10000 | \n",
- " 2 | \n",
- " 0 | \n",
- " 0.0000 | \n",
- "
\n",
- " \n",
- " 512008 | \n",
- " 100011 | \n",
- " 4 | \n",
- " 1 | \n",
- " 0.0376 | \n",
- "
\n",
- " \n",
- " 512009 | \n",
- " 100018 | \n",
- " 3 | \n",
- " 0 | \n",
- " 0.0000 | \n",
- "
\n",
- " \n",
- " 512010 | \n",
- " 100020 | \n",
- " 3 | \n",
- " 0 | \n",
- " 0.0000 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " sku_id comment_num has_bad_comment bad_comment_rate\n",
- "512006 1000 3 1 0.0417\n",
- "512007 10000 2 0 0.0000\n",
- "512008 100011 4 1 0.0376\n",
- "512009 100018 3 0 0.0000\n",
- "512010 100020 3 0 0.0000"
- ]
- },
- "execution_count": 18,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_comment = get_from_jdata_comment()\n",
- "item_comment.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [],
- "source": [
- "# SQL: left join\n",
- "item_behavior = pd.merge(item_base, item_behavior, on=['sku_id'], how='left')\n",
- "item_behavior = pd.merge(item_behavior, item_comment, on=['sku_id'], how='left')\n",
- " \n",
- "item_behavior.to_csv(ITEM_TABLE_FILE, index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " sku_id | \n",
- " a1 | \n",
- " a2 | \n",
- " a3 | \n",
- " cate | \n",
- " brand | \n",
- " browse_num | \n",
- " addcart_num | \n",
- " delcart_num | \n",
- " buy_num | \n",
- " favor_num | \n",
- " click_num | \n",
- " buy_addcart_ratio | \n",
- " buy_browse_ratio | \n",
- " buy_click_ratio | \n",
- " buy_favor_ratio | \n",
- " comment_num | \n",
- " has_bad_comment | \n",
- " bad_comment_rate | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 0 | \n",
- " 10 | \n",
- " 3 | \n",
- " 1 | \n",
- " 1 | \n",
- " 8 | \n",
- " 489 | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " 100002 | \n",
- " 3 | \n",
- " 2 | \n",
- " 2 | \n",
- " 8 | \n",
- " 489 | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 2 | \n",
- " 100003 | \n",
- " 1 | \n",
- " -1 | \n",
- " -1 | \n",
- " 8 | \n",
- " 30 | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 3 | \n",
- " 100006 | \n",
- " 1 | \n",
- " 2 | \n",
- " 1 | \n",
- " 8 | \n",
- " 545 | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- "
\n",
- " \n",
- " 4 | \n",
- " 10001 | \n",
- " -1 | \n",
- " 1 | \n",
- " 2 | \n",
- " 8 | \n",
- " 244 | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- " NaN | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " sku_id a1 a2 a3 cate brand browse_num addcart_num delcart_num \\\n",
- "0 10 3 1 1 8 489 NaN NaN NaN \n",
- "1 100002 3 2 2 8 489 NaN NaN NaN \n",
- "2 100003 1 -1 -1 8 30 NaN NaN NaN \n",
- "3 100006 1 2 1 8 545 NaN NaN NaN \n",
- "4 10001 -1 1 2 8 244 NaN NaN NaN \n",
- "\n",
- " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n",
- "0 NaN NaN NaN NaN NaN \n",
- "1 NaN NaN NaN NaN NaN \n",
- "2 NaN NaN NaN NaN NaN \n",
- "3 NaN NaN NaN NaN NaN \n",
- "4 NaN NaN NaN NaN NaN \n",
- "\n",
- " buy_click_ratio buy_favor_ratio comment_num has_bad_comment \\\n",
- "0 NaN NaN NaN NaN \n",
- "1 NaN NaN NaN NaN \n",
- "2 NaN NaN NaN NaN \n",
- "3 NaN NaN NaN NaN \n",
- "4 NaN NaN NaN NaN \n",
- "\n",
- " bad_comment_rate \n",
- "0 NaN \n",
- "1 NaN \n",
- "2 NaN \n",
- "3 NaN \n",
- "4 NaN "
- ]
- },
- "execution_count": 23,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_table = pd.read_csv(ITEM_TABLE_FILE)\n",
- "item_table.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 数据清洗\n",
- "\n",
- "用户清洗"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " age | \n",
- " sex | \n",
- " user_lv_cd | \n",
- " browse_num | \n",
- " addcart_num | \n",
- " delcart_num | \n",
- " buy_num | \n",
- " favor_num | \n",
- " click_num | \n",
- " buy_addcart_ratio | \n",
- " buy_browse_ratio | \n",
- " buy_click_ratio | \n",
- " buy_favor_ratio | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " count | \n",
- " 105,321.000 | \n",
- " 105,318.000 | \n",
- " 105,318.000 | \n",
- " 105,321.000 | \n",
- " 105,180.000 | \n",
- " 105,180.000 | \n",
- " 105,180.000 | \n",
- " 105,180.000 | \n",
- " 105,180.000 | \n",
- " 105,180.000 | \n",
- " 72,129.000 | \n",
- " 105,172.000 | \n",
- " 103,197.000 | \n",
- " 45,986.000 | \n",
- "
\n",
- " \n",
- " mean | \n",
- " 252,661.000 | \n",
- " 2.773 | \n",
- " 1.113 | \n",
- " 3.850 | \n",
- " 180.466 | \n",
- " 5.471 | \n",
- " 2.434 | \n",
- " 0.459 | \n",
- " 1.045 | \n",
- " 291.222 | \n",
- " 0.147 | \n",
- " 0.005 | \n",
- " 0.009 | \n",
- " 0.552 | \n",
- "
\n",
- " \n",
- " std | \n",
- " 30,403.698 | \n",
- " 1.672 | \n",
- " 0.956 | \n",
- " 1.072 | \n",
- " 273.437 | \n",
- " 10.618 | \n",
- " 5.600 | \n",
- " 1.048 | \n",
- " 3.442 | \n",
- " 460.031 | \n",
- " 0.270 | \n",
- " 0.022 | \n",
- " 0.074 | \n",
- " 0.473 | \n",
- "
\n",
- " \n",
- " min | \n",
- " 200,001.000 | \n",
- " -1.000 | \n",
- " 0.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- "
\n",
- " \n",
- " 25% | \n",
- " 226,331.000 | \n",
- " 3.000 | \n",
- " 0.000 | \n",
- " 3.000 | \n",
- " 40.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 59.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- "
\n",
- " \n",
- " 50% | \n",
- " 252,661.000 | \n",
- " 3.000 | \n",
- " 2.000 | \n",
- " 4.000 | \n",
- " 94.000 | \n",
- " 2.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 148.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " 75% | \n",
- " 278,991.000 | \n",
- " 4.000 | \n",
- " 2.000 | \n",
- " 5.000 | \n",
- " 212.000 | \n",
- " 6.000 | \n",
- " 3.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 342.000 | \n",
- " 0.167 | \n",
- " 0.002 | \n",
- " 0.001 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " max | \n",
- " 305,321.000 | \n",
- " 6.000 | \n",
- " 2.000 | \n",
- " 5.000 | \n",
- " 7,605.000 | \n",
- " 369.000 | \n",
- " 231.000 | \n",
- " 50.000 | \n",
- " 99.000 | \n",
- " 15,302.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id age sex user_lv_cd browse_num \\\n",
- "count 105,321.000 105,318.000 105,318.000 105,321.000 105,180.000 \n",
- "mean 252,661.000 2.773 1.113 3.850 180.466 \n",
- "std 30,403.698 1.672 0.956 1.072 273.437 \n",
- "min 200,001.000 -1.000 0.000 1.000 0.000 \n",
- "25% 226,331.000 3.000 0.000 3.000 40.000 \n",
- "50% 252,661.000 3.000 2.000 4.000 94.000 \n",
- "75% 278,991.000 4.000 2.000 5.000 212.000 \n",
- "max 305,321.000 6.000 2.000 5.000 7,605.000 \n",
- "\n",
- " addcart_num delcart_num buy_num favor_num click_num \\\n",
- "count 105,180.000 105,180.000 105,180.000 105,180.000 105,180.000 \n",
- "mean 5.471 2.434 0.459 1.045 291.222 \n",
- "std 10.618 5.600 1.048 3.442 460.031 \n",
- "min 0.000 0.000 0.000 0.000 0.000 \n",
- "25% 0.000 0.000 0.000 0.000 59.000 \n",
- "50% 2.000 0.000 0.000 0.000 148.000 \n",
- "75% 6.000 3.000 1.000 0.000 342.000 \n",
- "max 369.000 231.000 50.000 99.000 15,302.000 \n",
- "\n",
- " buy_addcart_ratio buy_browse_ratio buy_click_ratio buy_favor_ratio \n",
- "count 72,129.000 105,172.000 103,197.000 45,986.000 \n",
- "mean 0.147 0.005 0.009 0.552 \n",
- "std 0.270 0.022 0.074 0.473 \n",
- "min 0.000 0.000 0.000 0.000 \n",
- "25% 0.000 0.000 0.000 0.000 \n",
- "50% 0.000 0.000 0.000 1.000 \n",
- "75% 0.167 0.002 0.001 1.000 \n",
- "max 1.000 1.000 1.000 1.000 "
- ]
- },
- "execution_count": 24,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df_user = pd.read_csv('data/User_table.csv',header=0)\n",
- "pd.options.display.float_format = '{:,.3f}'.format #输出格式设置,保留三位小数\n",
- "df_user.describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "由上述统计信息发现: 第一行中根据User_id统计发现有105321个用户,发现有3个用户没有age,sex字段\n",
- "\n",
- "根据浏览、加购、删购、购买等记录却只有105180条记录,说明存在用户无任何交互记录,因此可以删除上述用户"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "删除没有age,sex字段的用户"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " age | \n",
- " sex | \n",
- " user_lv_cd | \n",
- " browse_num | \n",
- " addcart_num | \n",
- " delcart_num | \n",
- " buy_num | \n",
- " favor_num | \n",
- " click_num | \n",
- " buy_addcart_ratio | \n",
- " buy_browse_ratio | \n",
- " buy_click_ratio | \n",
- " buy_favor_ratio | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " 34072 | \n",
- " 234073 | \n",
- " nan | \n",
- " nan | \n",
- " 1 | \n",
- " 32.000 | \n",
- " 6.000 | \n",
- " 4.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 41.000 | \n",
- " 0.167 | \n",
- " 0.031 | \n",
- " 0.024 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " 38905 | \n",
- " 238906 | \n",
- " nan | \n",
- " nan | \n",
- " 1 | \n",
- " 171.000 | \n",
- " 3.000 | \n",
- " 2.000 | \n",
- " 2.000 | \n",
- " 3.000 | \n",
- " 464.000 | \n",
- " 0.667 | \n",
- " 0.012 | \n",
- " 0.004 | \n",
- " 0.667 | \n",
- "
\n",
- " \n",
- " 67704 | \n",
- " 267705 | \n",
- " nan | \n",
- " nan | \n",
- " 1 | \n",
- " 342.000 | \n",
- " 18.000 | \n",
- " 8.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 743.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " nan | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id age sex user_lv_cd browse_num addcart_num delcart_num \\\n",
- "34072 234073 nan nan 1 32.000 6.000 4.000 \n",
- "38905 238906 nan nan 1 171.000 3.000 2.000 \n",
- "67704 267705 nan nan 1 342.000 18.000 8.000 \n",
- "\n",
- " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n",
- "34072 1.000 0.000 41.000 0.167 0.031 \n",
- "38905 2.000 3.000 464.000 0.667 0.012 \n",
- "67704 0.000 0.000 743.000 0.000 0.000 \n",
- "\n",
- " buy_click_ratio buy_favor_ratio \n",
- "34072 0.024 1.000 \n",
- "38905 0.004 0.667 \n",
- "67704 0.000 nan "
- ]
- },
- "execution_count": 25,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df_user[df_user['age'].isnull()]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " age | \n",
- " sex | \n",
- " user_lv_cd | \n",
- " browse_num | \n",
- " addcart_num | \n",
- " delcart_num | \n",
- " buy_num | \n",
- " favor_num | \n",
- " click_num | \n",
- " buy_addcart_ratio | \n",
- " buy_browse_ratio | \n",
- " buy_click_ratio | \n",
- " buy_favor_ratio | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " count | \n",
- " 105,318.000 | \n",
- " 105,318.000 | \n",
- " 105,318.000 | \n",
- " 105,318.000 | \n",
- " 105,177.000 | \n",
- " 105,177.000 | \n",
- " 105,177.000 | \n",
- " 105,177.000 | \n",
- " 105,177.000 | \n",
- " 105,177.000 | \n",
- " 72,126.000 | \n",
- " 105,169.000 | \n",
- " 103,194.000 | \n",
- " 45,984.000 | \n",
- "
\n",
- " \n",
- " mean | \n",
- " 252,661.164 | \n",
- " 2.773 | \n",
- " 1.113 | \n",
- " 3.850 | \n",
- " 180.466 | \n",
- " 5.471 | \n",
- " 2.434 | \n",
- " 0.459 | \n",
- " 1.045 | \n",
- " 291.219 | \n",
- " 0.147 | \n",
- " 0.005 | \n",
- " 0.009 | \n",
- " 0.552 | \n",
- "
\n",
- " \n",
- " std | \n",
- " 30,404.012 | \n",
- " 1.672 | \n",
- " 0.956 | \n",
- " 1.071 | \n",
- " 273.440 | \n",
- " 10.618 | \n",
- " 5.600 | \n",
- " 1.048 | \n",
- " 3.442 | \n",
- " 460.034 | \n",
- " 0.270 | \n",
- " 0.022 | \n",
- " 0.074 | \n",
- " 0.473 | \n",
- "
\n",
- " \n",
- " min | \n",
- " 200,001.000 | \n",
- " -1.000 | \n",
- " 0.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- "
\n",
- " \n",
- " 25% | \n",
- " 226,330.250 | \n",
- " 3.000 | \n",
- " 0.000 | \n",
- " 3.000 | \n",
- " 40.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 59.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- "
\n",
- " \n",
- " 50% | \n",
- " 252,661.500 | \n",
- " 3.000 | \n",
- " 2.000 | \n",
- " 4.000 | \n",
- " 94.000 | \n",
- " 2.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 148.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " 75% | \n",
- " 278,991.750 | \n",
- " 4.000 | \n",
- " 2.000 | \n",
- " 5.000 | \n",
- " 212.000 | \n",
- " 6.000 | \n",
- " 3.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 342.000 | \n",
- " 0.167 | \n",
- " 0.002 | \n",
- " 0.001 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " max | \n",
- " 305,321.000 | \n",
- " 6.000 | \n",
- " 2.000 | \n",
- " 5.000 | \n",
- " 7,605.000 | \n",
- " 369.000 | \n",
- " 231.000 | \n",
- " 50.000 | \n",
- " 99.000 | \n",
- " 15,302.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id age sex user_lv_cd browse_num \\\n",
- "count 105,318.000 105,318.000 105,318.000 105,318.000 105,177.000 \n",
- "mean 252,661.164 2.773 1.113 3.850 180.466 \n",
- "std 30,404.012 1.672 0.956 1.071 273.440 \n",
- "min 200,001.000 -1.000 0.000 1.000 0.000 \n",
- "25% 226,330.250 3.000 0.000 3.000 40.000 \n",
- "50% 252,661.500 3.000 2.000 4.000 94.000 \n",
- "75% 278,991.750 4.000 2.000 5.000 212.000 \n",
- "max 305,321.000 6.000 2.000 5.000 7,605.000 \n",
- "\n",
- " addcart_num delcart_num buy_num favor_num click_num \\\n",
- "count 105,177.000 105,177.000 105,177.000 105,177.000 105,177.000 \n",
- "mean 5.471 2.434 0.459 1.045 291.219 \n",
- "std 10.618 5.600 1.048 3.442 460.034 \n",
- "min 0.000 0.000 0.000 0.000 0.000 \n",
- "25% 0.000 0.000 0.000 0.000 59.000 \n",
- "50% 2.000 0.000 0.000 0.000 148.000 \n",
- "75% 6.000 3.000 1.000 0.000 342.000 \n",
- "max 369.000 231.000 50.000 99.000 15,302.000 \n",
- "\n",
- " buy_addcart_ratio buy_browse_ratio buy_click_ratio buy_favor_ratio \n",
- "count 72,126.000 105,169.000 103,194.000 45,984.000 \n",
- "mean 0.147 0.005 0.009 0.552 \n",
- "std 0.270 0.022 0.074 0.473 \n",
- "min 0.000 0.000 0.000 0.000 \n",
- "25% 0.000 0.000 0.000 0.000 \n",
- "50% 0.000 0.000 0.000 1.000 \n",
- "75% 0.167 0.002 0.001 1.000 \n",
- "max 1.000 1.000 1.000 1.000 "
- ]
- },
- "execution_count": 26,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "delete_list = df_user[df_user['age'].isnull()].index\n",
- "df_user.drop(delete_list,axis=0,inplace=True)\n",
- "df_user.describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "删除无交互记录的用户"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "105177\n"
- ]
- }
- ],
- "source": [
- "df_naction = df_user[(df_user['browse_num'].isnull()) & (df_user['addcart_num'].isnull()) & (df_user['delcart_num'].isnull()) & (df_user['buy_num'].isnull()) & (df_user['favor_num'].isnull()) & (df_user['click_num'].isnull())]\n",
- "df_user.drop(df_naction.index,axis=0,inplace=True)\n",
- "print(len(df_user))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "统计并删除无购买记录的用户"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "75694\n"
- ]
- }
- ],
- "source": [
- "df_bzero = df_user[df_user['buy_num']==0]\n",
- "# 输出购买数为0的总记录数\n",
- "print(len(df_bzero))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " age | \n",
- " sex | \n",
- " user_lv_cd | \n",
- " browse_num | \n",
- " addcart_num | \n",
- " delcart_num | \n",
- " buy_num | \n",
- " favor_num | \n",
- " click_num | \n",
- " buy_addcart_ratio | \n",
- " buy_browse_ratio | \n",
- " buy_click_ratio | \n",
- " buy_favor_ratio | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " count | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- " 29,483.000 | \n",
- "
\n",
- " \n",
- " mean | \n",
- " 250,746.445 | \n",
- " 2.914 | \n",
- " 1.025 | \n",
- " 4.272 | \n",
- " 302.488 | \n",
- " 10.525 | \n",
- " 4.673 | \n",
- " 1.637 | \n",
- " 1.677 | \n",
- " 486.653 | \n",
- " 0.360 | \n",
- " 0.018 | \n",
- " 0.030 | \n",
- " 0.862 | \n",
- "
\n",
- " \n",
- " std | \n",
- " 29,979.676 | \n",
- " 1.490 | \n",
- " 0.959 | \n",
- " 0.808 | \n",
- " 391.535 | \n",
- " 14.301 | \n",
- " 7.568 | \n",
- " 1.412 | \n",
- " 4.584 | \n",
- " 658.671 | \n",
- " 0.320 | \n",
- " 0.038 | \n",
- " 0.136 | \n",
- " 0.287 | \n",
- "
\n",
- " \n",
- " min | \n",
- " 200,001.000 | \n",
- " -1.000 | \n",
- " 0.000 | \n",
- " 2.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.004 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.010 | \n",
- "
\n",
- " \n",
- " 25% | \n",
- " 225,058.500 | \n",
- " 3.000 | \n",
- " 0.000 | \n",
- " 4.000 | \n",
- " 76.000 | \n",
- " 3.000 | \n",
- " 0.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 116.000 | \n",
- " 0.118 | \n",
- " 0.004 | \n",
- " 0.002 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " 50% | \n",
- " 249,144.000 | \n",
- " 3.000 | \n",
- " 1.000 | \n",
- " 4.000 | \n",
- " 178.000 | \n",
- " 6.000 | \n",
- " 2.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 282.000 | \n",
- " 0.250 | \n",
- " 0.008 | \n",
- " 0.005 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " 75% | \n",
- " 276,252.500 | \n",
- " 4.000 | \n",
- " 2.000 | \n",
- " 5.000 | \n",
- " 381.000 | \n",
- " 13.000 | \n",
- " 6.000 | \n",
- " 2.000 | \n",
- " 1.000 | \n",
- " 604.000 | \n",
- " 0.500 | \n",
- " 0.018 | \n",
- " 0.012 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " max | \n",
- " 305,318.000 | \n",
- " 6.000 | \n",
- " 2.000 | \n",
- " 5.000 | \n",
- " 7,605.000 | \n",
- " 288.000 | \n",
- " 178.000 | \n",
- " 50.000 | \n",
- " 96.000 | \n",
- " 15,302.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id age sex user_lv_cd browse_num addcart_num \\\n",
- "count 29,483.000 29,483.000 29,483.000 29,483.000 29,483.000 29,483.000 \n",
- "mean 250,746.445 2.914 1.025 4.272 302.488 10.525 \n",
- "std 29,979.676 1.490 0.959 0.808 391.535 14.301 \n",
- "min 200,001.000 -1.000 0.000 2.000 1.000 0.000 \n",
- "25% 225,058.500 3.000 0.000 4.000 76.000 3.000 \n",
- "50% 249,144.000 3.000 1.000 4.000 178.000 6.000 \n",
- "75% 276,252.500 4.000 2.000 5.000 381.000 13.000 \n",
- "max 305,318.000 6.000 2.000 5.000 7,605.000 288.000 \n",
- "\n",
- " delcart_num buy_num favor_num click_num buy_addcart_ratio \\\n",
- "count 29,483.000 29,483.000 29,483.000 29,483.000 29,483.000 \n",
- "mean 4.673 1.637 1.677 486.653 0.360 \n",
- "std 7.568 1.412 4.584 658.671 0.320 \n",
- "min 0.000 1.000 0.000 0.000 0.004 \n",
- "25% 0.000 1.000 0.000 116.000 0.118 \n",
- "50% 2.000 1.000 0.000 282.000 0.250 \n",
- "75% 6.000 2.000 1.000 604.000 0.500 \n",
- "max 178.000 50.000 96.000 15,302.000 1.000 \n",
- "\n",
- " buy_browse_ratio buy_click_ratio buy_favor_ratio \n",
- "count 29,483.000 29,483.000 29,483.000 \n",
- "mean 0.018 0.030 0.862 \n",
- "std 0.038 0.136 0.287 \n",
- "min 0.000 0.000 0.010 \n",
- "25% 0.004 0.002 1.000 \n",
- "50% 0.008 0.005 1.000 \n",
- "75% 0.018 0.012 1.000 \n",
- "max 1.000 1.000 1.000 "
- ]
- },
- "execution_count": 29,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df_user = df_user[df_user['buy_num']!=0] # 只要有购买记录的\n",
- "df_user.describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "删除爬虫及惰性用户\n",
- "\n",
- "由上表所知,浏览购买转换比和点击购买转换比均值为0.018,0.030,因此这里认为浏览购买转换比和点击购买转换比小于0.0005的用户为惰性用户"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "90\n"
- ]
- }
- ],
- "source": [
- "bindex = df_user[df_user['buy_browse_ratio']<0.0005].index\n",
- "print (len(bindex))\n",
- "df_user.drop(bindex,axis=0,inplace=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "323\n"
- ]
- }
- ],
- "source": [
- "cindex = df_user[df_user['buy_click_ratio']<0.0005].index\n",
- "print (len(cindex))\n",
- "df_user.drop(cindex,axis=0,inplace=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " user_id | \n",
- " age | \n",
- " sex | \n",
- " user_lv_cd | \n",
- " browse_num | \n",
- " addcart_num | \n",
- " delcart_num | \n",
- " buy_num | \n",
- " favor_num | \n",
- " click_num | \n",
- " buy_addcart_ratio | \n",
- " buy_browse_ratio | \n",
- " buy_click_ratio | \n",
- " buy_favor_ratio | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " count | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- " 29,070.000 | \n",
- "
\n",
- " \n",
- " mean | \n",
- " 250,767.099 | \n",
- " 2.910 | \n",
- " 1.028 | \n",
- " 4.268 | \n",
- " 280.260 | \n",
- " 10.145 | \n",
- " 4.457 | \n",
- " 1.644 | \n",
- " 1.589 | \n",
- " 447.113 | \n",
- " 0.364 | \n",
- " 0.019 | \n",
- " 0.031 | \n",
- " 0.866 | \n",
- "
\n",
- " \n",
- " std | \n",
- " 29,998.870 | \n",
- " 1.492 | \n",
- " 0.959 | \n",
- " 0.809 | \n",
- " 325.129 | \n",
- " 13.443 | \n",
- " 6.998 | \n",
- " 1.420 | \n",
- " 4.294 | \n",
- " 530.994 | \n",
- " 0.320 | \n",
- " 0.038 | \n",
- " 0.137 | \n",
- " 0.282 | \n",
- "
\n",
- " \n",
- " min | \n",
- " 200,001.000 | \n",
- " -1.000 | \n",
- " 0.000 | \n",
- " 2.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 0.000 | \n",
- " 0.004 | \n",
- " 0.001 | \n",
- " 0.001 | \n",
- " 0.018 | \n",
- "
\n",
- " \n",
- " 25% | \n",
- " 225,036.000 | \n",
- " 3.000 | \n",
- " 0.000 | \n",
- " 4.000 | \n",
- " 75.000 | \n",
- " 3.000 | \n",
- " 0.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 114.000 | \n",
- " 0.125 | \n",
- " 0.004 | \n",
- " 0.002 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " 50% | \n",
- " 249,200.500 | \n",
- " 3.000 | \n",
- " 1.000 | \n",
- " 4.000 | \n",
- " 174.000 | \n",
- " 6.000 | \n",
- " 2.000 | \n",
- " 1.000 | \n",
- " 0.000 | \n",
- " 275.000 | \n",
- " 0.250 | \n",
- " 0.008 | \n",
- " 0.005 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " 75% | \n",
- " 276,284.000 | \n",
- " 4.000 | \n",
- " 2.000 | \n",
- " 5.000 | \n",
- " 366.000 | \n",
- " 13.000 | \n",
- " 6.000 | \n",
- " 2.000 | \n",
- " 1.000 | \n",
- " 585.000 | \n",
- " 0.500 | \n",
- " 0.018 | \n",
- " 0.012 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- " max | \n",
- " 305,318.000 | \n",
- " 6.000 | \n",
- " 2.000 | \n",
- " 5.000 | \n",
- " 5,007.000 | \n",
- " 288.000 | \n",
- " 158.000 | \n",
- " 50.000 | \n",
- " 69.000 | \n",
- " 8,156.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- " 1.000 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " user_id age sex user_lv_cd browse_num addcart_num \\\n",
- "count 29,070.000 29,070.000 29,070.000 29,070.000 29,070.000 29,070.000 \n",
- "mean 250,767.099 2.910 1.028 4.268 280.260 10.145 \n",
- "std 29,998.870 1.492 0.959 0.809 325.129 13.443 \n",
- "min 200,001.000 -1.000 0.000 2.000 1.000 0.000 \n",
- "25% 225,036.000 3.000 0.000 4.000 75.000 3.000 \n",
- "50% 249,200.500 3.000 1.000 4.000 174.000 6.000 \n",
- "75% 276,284.000 4.000 2.000 5.000 366.000 13.000 \n",
- "max 305,318.000 6.000 2.000 5.000 5,007.000 288.000 \n",
- "\n",
- " delcart_num buy_num favor_num click_num buy_addcart_ratio \\\n",
- "count 29,070.000 29,070.000 29,070.000 29,070.000 29,070.000 \n",
- "mean 4.457 1.644 1.589 447.113 0.364 \n",
- "std 6.998 1.420 4.294 530.994 0.320 \n",
- "min 0.000 1.000 0.000 0.000 0.004 \n",
- "25% 0.000 1.000 0.000 114.000 0.125 \n",
- "50% 2.000 1.000 0.000 275.000 0.250 \n",
- "75% 6.000 2.000 1.000 585.000 0.500 \n",
- "max 158.000 50.000 69.000 8,156.000 1.000 \n",
- "\n",
- " buy_browse_ratio buy_click_ratio buy_favor_ratio \n",
- "count 29,070.000 29,070.000 29,070.000 \n",
- "mean 0.019 0.031 0.866 \n",
- "std 0.038 0.137 0.282 \n",
- "min 0.001 0.001 0.018 \n",
- "25% 0.004 0.002 1.000 \n",
- "50% 0.008 0.005 1.000 \n",
- "75% 0.018 0.012 1.000 \n",
- "max 1.000 1.000 1.000 "
- ]
- },
- "execution_count": 32,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df_user.describe()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 33,
- "metadata": {},
- "outputs": [],
- "source": [
- "df_user.to_csv(\"data/JData_FUser.csv\", index=None)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}