女性服装数据分析(电商数据)版本1
女性服装数据分析(电商数据)版本1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
data = pd.read_csv('Womens_Clothing.csv')
# 查看数据结构
data
Unnamed: 0 | Clothing ID | Age | Title | Review Text | Rating | Recommended IND | Positive Feedback Count | Division Name | Department Name | Class Name | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 767 | 33 | NaN | Absolutely wonderful - silky and sexy and comf... | 4 | 1 | 0 | Initmates | Intimate | Intimates |
1 | 1 | 1080 | 34 | NaN | Love this dress! it's sooo pretty. i happene... | 5 | 1 | 4 | General | Dresses | Dresses |
2 | 2 | 1077 | 60 | Some major design flaws | I had such high hopes for this dress and reall... | 3 | 0 | 0 | General | Dresses | Dresses |
3 | 3 | 1049 | 50 | My favorite buy! | I love, love, love this jumpsuit. it's fun, fl... | 5 | 1 | 0 | General Petite | Bottoms | Pants |
4 | 4 | 847 | 47 | Flattering shirt | This shirt is very flattering to all due to th... | 5 | 1 | 6 | General | Tops | Blouses |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23481 | 23481 | 1104 | 34 | Great dress for many occasions | I was very happy to snag this dress at such a ... | 5 | 1 | 0 | General Petite | Dresses | Dresses |
23482 | 23482 | 862 | 48 | Wish it was made of cotton | It reminds me of maternity clothes. soft, stre... | 3 | 1 | 0 | General Petite | Tops | Knits |
23483 | 23483 | 1104 | 31 | Cute, but see through | This fit well, but the top was very see throug... | 3 | 0 | 1 | General Petite | Dresses | Dresses |
23484 | 23484 | 1084 | 28 | Very cute dress, perfect for summer parties an... | I bought this dress for a wedding i have this ... | 3 | 1 | 2 | General | Dresses | Dresses |
23485 | 23485 | 1104 | 52 | Please make more like this one! | This dress in a lovely platinum is feminine an... | 5 | 1 | 22 | General Petite | Dresses | Dresses |
23486 rows × 11 columns
有上面结果可知:
该数据集包括23486行和10个特征变量。每行对应一个客户评论,并包含以下变量:
**服装ID:**整数分类变量,指的是要查看的特定作品。
**年龄:**评论者年龄的正整数变量。
**标题:**评论标题的字符串变量。
**评论文本:**评论正文的字符串变量。
**评分:**客户授予的产品评分的正序整数变量,从1最差,到5最佳。
**推荐的IND:**二进制变量,说明客户在推荐1的地方推荐产品,不推荐0的地方。
**积极的反馈计数:**积极的整数,记录发现该评论为积极的其他客户的数量。
**高级部门名称:**产品高级部门的分类名称。
**部门名称:**产品部门名称的分类名称。
**类名称:**产品类名称的分类名称。
中文名称 英文名称
服装ID Clothing ID
年龄 Age
标题 Title
评论文本 Review Text
评分: Rating
推荐的IND Recommended IND
积极的反馈计数 Positive Feedback Count
高级部门名称 Division Name
部门名称 Department Name
类名称 Class Name
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
Unnamed: 0 23486 non-null int64
Clothing ID 23486 non-null int64
Age 23486 non-null int64
Title 19676 non-null object
Review Text 22641 non-null object
Rating 23486 non-null int64
Recommended IND 23486 non-null int64
Positive Feedback Count 23486 non-null int64
Division Name 23472 non-null object
Department Name 23472 non-null object
Class Name 23472 non-null object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB
# 查看缺失值
# data.isnull()
# 删除缺失值
df = data.dropna()
df
Unnamed: 0 | Clothing ID | Age | Title | Review Text | Rating | Recommended IND | Positive Feedback Count | Division Name | Department Name | Class Name | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | 1077 | 60 | Some major design flaws | I had such high hopes for this dress and reall... | 3 | 0 | 0 | General | Dresses | Dresses |
3 | 3 | 1049 | 50 | My favorite buy! | I love, love, love this jumpsuit. it's fun, fl... | 5 | 1 | 0 | General Petite | Bottoms | Pants |
4 | 4 | 847 | 47 | Flattering shirt | This shirt is very flattering to all due to th... | 5 | 1 | 6 | General | Tops | Blouses |
5 | 5 | 1080 | 49 | Not for the very petite | I love tracy reese dresses, but this one is no... | 2 | 0 | 4 | General | Dresses | Dresses |
6 | 6 | 858 | 39 | Cagrcoal shimmer fun | I aded this in my basket at hte last mintue to... | 5 | 1 | 1 | General Petite | Tops | Knits |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23481 | 23481 | 1104 | 34 | Great dress for many occasions | I was very happy to snag this dress at such a ... | 5 | 1 | 0 | General Petite | Dresses | Dresses |
23482 | 23482 | 862 | 48 | Wish it was made of cotton | It reminds me of maternity clothes. soft, stre... | 3 | 1 | 0 | General Petite | Tops | Knits |
23483 | 23483 | 1104 | 31 | Cute, but see through | This fit well, but the top was very see throug... | 3 | 0 | 1 | General Petite | Dresses | Dresses |
23484 | 23484 | 1084 | 28 | Very cute dress, perfect for summer parties an... | I bought this dress for a wedding i have this ... | 3 | 1 | 2 | General | Dresses | Dresses |
23485 | 23485 | 1104 | 52 | Please make more like this one! | This dress in a lovely platinum is feminine an... | 5 | 1 | 22 | General Petite | Dresses | Dresses |
19662 rows × 11 columns
分析
# 1. 可视化 给出评分者的年龄
plt.hist(df['Age'], color=color[1], label='age')
plt.legend()
plt.xlabel('age')
plt.ylabel('count')
plt.title('age of commentator')
print('\n figure 01')
figure 01
得出结论
由figure01 可得出:给出评论的人的年龄大多在25到45之间,青年、中年人较多
# 2. 可视化不同年龄的等级图
plt.figure(figsize=(10, 8))
sns.boxplot(x='Rating', y='Age', data=df)
plt.title('age of rating')
print('\n figure 02')
figure 02
得出结论
由figure02 可得出:给出评分分布的年龄都差不多
3、每个部门、推荐什么服装?
查看Division Name,Department Name和’Class Name的唯一值
print('高级部门Division Name', df['Division Name'].unique())
print()
print('部门Department Name',df['Department Name'].unique())
print()
print('类名称Class Name',df['Class Name'].unique())
高级部门Division Name ['General' 'General Petite' 'Initmates']部门Department Name ['Dresses' 'Bottoms' 'Tops' 'Intimate' 'Jackets' 'Trend']类名称Class Name ['Dresses' 'Pants' 'Blouses' 'Knits' 'Intimates' 'Outerwear' 'Lounge''Sweaters' 'Skirts' 'Fine gauge' 'Sleep' 'Jackets' 'Swim' 'Trend' 'Jeans''Shorts' 'Legwear' 'Layering' 'Casual bottoms' 'Chemises']
将Recommended IND推荐产品为1,不推荐0的数据分开
# recommend not_recommend
recommend = df[df['Recommended IND'] == 1]
not_recommend = df[df['Recommended IND'] == 0]
# recommend.head()
not_recommend.head()
Unnamed: 0 | Clothing ID | Age | Title | Review Text | Rating | Recommended IND | Positive Feedback Count | Division Name | Department Name | Class Name | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | 1077 | 60 | Some major design flaws | I had such high hopes for this dress and reall... | 3 | 0 | 0 | General | Dresses | Dresses |
5 | 5 | 1080 | 49 | Not for the very petite | I love tracy reese dresses, but this one is no... | 2 | 0 | 4 | General | Dresses | Dresses |
10 | 10 | 1077 | 53 | Dress looks like it's made of cheap material | Dress runs small esp where the zipper area run... | 3 | 0 | 14 | General | Dresses | Dresses |
22 | 22 | 1077 | 31 | Not what it looks like | First of all, this is not pullover styling. th... | 2 | 0 | 7 | General | Dresses | Dresses |
25 | 25 | 697 | 31 | Falls flat | Loved the material, but i didnt really look at... | 3 | 0 | 0 | Initmates | Intimate | Lounge |
# 4.可视化不同部门的推荐和不推荐的叠加柱状图
plt.figure(figsize=(12,8))
plt.hist(recommend['Department Name'], color=color[2], alpha=0.5, label='recommend')
plt.hist(not_recommend['Department Name'], color=color[4], alpha=0.5, label='not_recommend')
plt.legend()
plt.xticks(rotation=45)
plt.title('Department recommend and not_recommend')
print('\n figure 03')
figure 03
得出结论
由figure03可知 绿色的面积大于X色的面积,由此说明,大部分部门都可以推荐商品
# 可视化不同商品的推荐和不推荐叠加柱状图
plt.figure(figsize=(12,8))
plt.hist(recommend['Class Name'], color=color[1], alpha=0.5, label='recommend')
plt.hist(not_recommend['Class Name'], color=color[5], alpha=0.5, label='not_recommend')
plt.legend()
plt.xticks(rotation=45)
plt.title('Class recommend and not_recommend')
print('\n figure 04')
figure 04
得出结论
从figure04看出:并不是卖最多的Knits商品推荐成功率最大
# 哪个年龄段的人对什么样的衣服发表什么样的评论
df['Review Length'] = df['Review Text'].astype(str).apply(len)
df
E:\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Unnamed: 0 | Clothing ID | Age | Title | Review Text | Rating | Recommended IND | Positive Feedback Count | Division Name | Department Name | Class Name | Review Length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | 1077 | 60 | Some major design flaws | I had such high hopes for this dress and reall... | 3 | 0 | 0 | General | Dresses | Dresses | 500 |
3 | 3 | 1049 | 50 | My favorite buy! | I love, love, love this jumpsuit. it's fun, fl... | 5 | 1 | 0 | General Petite | Bottoms | Pants | 124 |
4 | 4 | 847 | 47 | Flattering shirt | This shirt is very flattering to all due to th... | 5 | 1 | 6 | General | Tops | Blouses | 192 |
5 | 5 | 1080 | 49 | Not for the very petite | I love tracy reese dresses, but this one is no... | 2 | 0 | 4 | General | Dresses | Dresses | 488 |
6 | 6 | 858 | 39 | Cagrcoal shimmer fun | I aded this in my basket at hte last mintue to... | 5 | 1 | 1 | General Petite | Tops | Knits | 496 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23481 | 23481 | 1104 | 34 | Great dress for many occasions | I was very happy to snag this dress at such a ... | 5 | 1 | 0 | General Petite | Dresses | Dresses | 131 |
23482 | 23482 | 862 | 48 | Wish it was made of cotton | It reminds me of maternity clothes. soft, stre... | 3 | 1 | 0 | General Petite | Tops | Knits | 223 |
23483 | 23483 | 1104 | 31 | Cute, but see through | This fit well, but the top was very see throug... | 3 | 0 | 1 | General Petite | Dresses | Dresses | 208 |
23484 | 23484 | 1084 | 28 | Very cute dress, perfect for summer parties an... | I bought this dress for a wedding i have this ... | 3 | 1 | 2 | General | Dresses | Dresses | 427 |
23485 | 23485 | 1104 | 52 | Please make more like this one! | This dress in a lovely platinum is feminine an... | 5 | 1 | 22 | General Petite | Dresses | Dresses | 110 |
19662 rows × 12 columns
# 绘制单Review Length变量分布
# 单变量分布的最方便的方法是sns.distplot()功能。默认情况下,这将绘制直方图并拟合核密度估计(KDE)
fig = plt.figure(figsize=(12, 8))
ax = sns.distplot(df['Review Length'], color=color[3])
ax = plt.title("Length of Reviews")
print('\n figure 05')
figure 05
得出结论
由figure05可得出 大部分人评论的长度都基本在500
# 可视化不同年龄段的评论长度分布
plt.figure(figsize=(18,8))
sns.boxplot(x='Age', y='Review Length', data=df)
print('\n figure 06')
figure 06
# 评分与正面反馈计数
plt.figure(figsize=(12,8))
sns.boxplot(x = 'Rating', y = 'Positive Feedback Count', data = df)
print('\n figure 07')
figure 07
得出结论
由图figure07可得出 评分在3以上的正面反馈的计数大
词云评论可视化
# 1. 数据清洗
import re
from wordcloud import WordCloud, STOPWORDSdef clean_data(text):letters_only = re.sub("[^a-zA-Z]", " ", text) # 替换标点符合等words = letters_only.lower().split() return( " ".join( words ))
# return letters_onlystopwords= set(STOPWORDS)|{'skirt', 'blouse','dress','sweater', 'shirt','bottom', 'pant', 'pants' 'jean', 'jeans','jacket', 'top', 'dresse'}def create_cloud(rating):x= [i for i in rating]y= ' '.join(x)cloud = WordCloud(background_color='white',width=1600, height=800,max_words=100,stopwords= stopwords).generate(y)plt.figure(figsize=(15,7.5))plt.axis('off')plt.imshow(cloud)plt.show()
# 等级是5的词云图
rating5= df[df['Rating']==5]['Review Text'].apply(clean_data)
create_cloud(rating5)
# 等级是4的词云图
rating4= df[df['Rating']==4]['Review Text'].apply(clean_data)
create_cloud(rating4)
# 等级是3的词云图
rating3= df[df['Rating']==3]['Review Text'].apply(clean_data)
create_cloud(rating3)
# 等级是2的词云图
rating2= df[df['Rating']==2]['Review Text'].apply(clean_data)
create_cloud(rating2)
# 等级是1的词云图
rating1= df[df['Rating']==1]['Review Text'].apply(clean_data)
create_cloud(rating1)
女性服装数据分析(电商数据)版本1相关推荐
- 电商数据指标与《电商数据分析与数据化营销》
文章目录 前言 1 电商数据指标 2 <电商数据分析与数据化营销> 3 总结 参考 前言 想了解电商的指标和电商行业的一些数据分析 1 电商数据指标 2 <电商数据分析与数据化营销& ...
- 数据仓库之电商数仓-- 4、可视化报表Superset
目录 一.Superset入门 1.1 Superset概述 1.2 Superset应用场景 二.Superset安装及使用 2.1 安装Python环境 2.1.1 安装Miniconda 2.1 ...
- 2 大数据电商数仓项目——项目需求及架构设计
2 大数据电商数仓项目--项目需求及架构设计 2.1 项目需求分析 用户行为数据采集平台搭建. 业务数据采集平台搭建. 数据仓库维度建模(核心):主要设计ODS.DWD.DWS.AWT.ADS等各个层 ...
- 大数据项目之电商数仓离线计算
本次项目是基于企业大数据的电商经典案例项目(大数据日志以及网站数据分析),业务分析.技术选型.架构设计.集群规划.安装部署.整合继承与开发和web可视化交互设计. 1.系统数据流程设计 我这里主要分享 ...
- 电商数仓描述_笔记-尚硅谷大数据项目数据仓库-电商数仓V1.2新版
架构 项目框架 数仓架构 存储压缩 Snappy与LZO LZO安装: 读取LZO文件时,需要先创建索引,才可以进行切片. 框架版本选型Apache:运维麻烦,需要自己调研兼容性. CDH:国内使用最 ...
- 基于电商数据的用户行为分析之需求分析
电商用户行为分析需求分析说明书 项目名称: 电商用户行为分析 修订时间: 2021-05-28 修订版本: 1.0 一.引言 1.目的 通过编写需求分析文档,对基于电商数据的用户行为分析系统进行介绍, ...
- 大数据项目 --- 电商数仓(一)
这个项目实在数据采集基础使用的,需要提前复习之前学的东西,否则的话就是很难继续学习.详见博客数据项目一 ---数据采集项目.大数据项目 --- 数据采集项目_YllasdW的博客-CSDN博客大数据第 ...
- 复盘离线电商数仓3.0项目–数据开发梳理
复盘离线电商数仓项目–数据开发梳理 业务数据 数仓分层 ods层到ads层的开发 开源BI工具Superset ODS层业务数据&日志数据 ods层业务数据 使用Sqoop脚本从Mysql数据 ...
- python爬虫实例电商_如何用代码爬抓电商数据(附淘宝API调用实例)
原标题:如何用代码爬抓电商数据(附淘宝API调用实例) 欢迎关注天善智能 hellobi.com,我们是专注于商业智能BI,大数据,数据分析领域的垂直社区,学习.问答.求职,一站式搞定! 对商业智能B ...
最新文章
- wangEditor - 轻量级web富文本编辑器(可带图片上传)
- 第三章 线性代数回顾-机器学习老师板书-斯坦福吴恩达教授
- Redis简介和Redis Template用法整理
- 数据结构练习——双向链表
- 镭速(Raysync)文件传输高可用部署介绍!
- Git忽略文件或文件夹
- 如何运行vue项目(从gethub上download的开源项目)
- 基于深度卷积神经网络的玉米病害识别
- win10设置自定html背景,win10开始菜单背景和图标自定义的方法
- dev chart 绘制图形
- Android系统五层架构
- 带张光盘去装机(转)
- 02. 只允许使用QQ和微信 - 服务 ❀ 飞塔 (Fortinet6.0) 防火墙
- 涂鸦蓝牙SDK开发系列教程——8.Board API 说明
- python系列tkinter之pack布局、place布局和grid布局
- ios 按钮图片拉伸_iOS中实现图片自适应拉伸效果的方法
- 浅谈一下线程中synchronized块、wait,notify的用法
- 网吧组网 光纤接入与ADSL接入的较量
- Python机器学习实战:如何用Pandas处理缺失值
- 【题解】LuoGu4611:[COI2012] TRAMPOLIN
热门文章
- 【云原生网关】apisix使用详解
- 微信 9 年:张小龙指路,微信 AI 全面开放 NLP 能力
- 蒙特卡洛模拟模拟的matlab语言代码
- (一)移动App开发——Native App-原生开发Web App-网页开发Hybrid App-混合开发网页打包成App四方式-Cordova-APPCan-DCloud-API Cloud
- jquery追加html及移除,jQuery 添加元素和删除元素的方法
- 业务建模 活动图和序列图
- 第3章 面向对象设计基础
- 【论文笔记】Semi-supervised Domain Adaptation via Minimax Entropy(ICCV 2019)
- isNaN()的用法
- 【C语言】strncpy函数和strncpy_s函数的不同!关于末尾追加\0