pandas 处理数据表常用公式

# =============================================================================
# Pandas
# pandas is a fast, powerful, flexible and easy to use open source data analysis
# and manipulation tool, built on top of the Python programming language.
# 在Pandas中有两种主要的数据结构，Series & DataFrame
# Series 可以理解为一维数组，与一维数组主要区别为Series具有索引（index)
# DataFrame可以理解为二维结构的tabular, 类比excel中的一张表
# =============================================================================
import pandas as pd# The DataFrame is one of Pandas' most important data structures.
# It's basically a way to store tabular data where you can label the rows
# and the columns. One way to build a DataFrame is from a dictionary.
test = {'a':[1,2,3,4,5],'b':[9,8,7,6,5]}
test_df1 = pd.DataFrame(test)
test_df2 = pd.DataFrame.from_dict(test, orient = 'columns')# Note: if you pass an scalar to this method, there will be an error
europe = {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo'}
europe_df = pd.DataFrame(europe) # ValueError: If using all scalar values, you must pass an index
europe_df = pd.DataFrame(europe,index=[0])# =============================================================================
# candy_crush 为一款三消游戏， 此数据集中记录了一周的玩家游戏记录
# 游戏可以选择关卡，可以重复玩同一个关卡
# 数据下载链接见文末
# =============================================================================
candy_crush = pd.read_csv(r'C:\Users\zhou.c.15\Downloads\candy_crush.csv')
print(candy_crush.info())
print(candy_crush.head())
# =============================================================================
# Subset and Slicing
# =============================================================================
# 前50行数据
print(candy_crush[0:50])
# 查看列名
print(candy_crush.columns)
# 更改列名的两种方式：
# 批量更改, 写出所有列名： candy_crush.columns = new_columns
# 指定更改, 修改指定列名： candy_crush.rename(columns = {'player_id': 'playerid'}, inplace = True)
# 这里inplace 默认为False, 需设置为True才能修改成功，否则原dataframe的列名不会被修改# 选择部分列的时候需要用双括号[[]], 返回结果为dataframe, 如果用单括号返回结果为Series
type(candy_crush[['player_id', 'dt']])
type(candy_crush[['player_id']])
type(candy_crush['player_id'])# 数据定位
# 当需要修改表中部分数据的时候，不可以用slice的方式， 可以使用.loc/.iloc的方式，结合filtering筛选出需要修改的行列,直接赋值修改内容
# iloc based on row/column index
candy_crush.iloc[0] # 选择第一行数据，返回数据类型为Series
candy_crush.iloc[[0]] # 选择第一行数据，返回数据类型为DataFrame
candy_crush.iloc[0:3, 0] # 选择第一列的前三行数据，返回数据类型为Series
candy_crush.iloc[:, [0]] # 选择第一列的前三列数据，返回数据类型为DataFrame
candy_crush.iloc[:, 0:3] # 选择前三列数据，当选择列数大于1的时候，无需加中括号，返回数据类型为DataFrame# 如果DataFrame的index被修改为其他形式
candy_crush.set_index('player_id', inplace = True) #此时index被设置为列 player_id
candy_crush.iloc[0:3] # 不影响iloc使用# loc based on row/column label(name)
candy_crush.loc[0] # KeyError
type(candy_crush.loc[:, 'dt']) #选择dt列数据，返回数据类型为Series
type(candy_crush.loc[:, ['dt']]) #选择dt列数据，返回数据类型为DataFrame
candy_crush.reset_index(drop = False, inplace = True) # Drop表示是否丢掉原index,如果为False，则原index会被设置为新的列
# reset its index:
candy_crush.loc[0] # 设置了新的index，返回第一行数据，Series# =============================================================================
# Filtering
# =============================================================================
ftl = candy_crush.dt == '2014-01-01' # 返回 bool Series
candy_crush[ftl] # 返回 2014-01-01 的数据# 当有多个filter 条件时， 使用 | & ~
multi_fil = (candy_crush.dt == '2014-01-01') & (candy_crush.num_success == 1)
candy_crush[multi_fil] # 2014-01-01 这天通关的数据
candy_crush[~multi_fil] # 除2014-01-01 这天之外的通关的数据
candy_crush.player_id[multi_fil] # 2014-01-01 这天通关的玩家
candy_crush.player_id[multi_fil].unique() # 玩家id有重复的，去重； 返回数据类型为numpy.ndarray
# note: unique() 为Series的attribute， 所以这里用
# candy_crush.player_id[multi_fil].unique() 或者 candy_crush['player_id'][multi_fil].unique() 都行
# 但 candy_crush[['player_id']][multi_fil].unique() 会报错
player_list = list(candy_crush.player_id[multi_fil].unique()) # 转为list数据类型# =============================================================================
# 聚合与 Agg functions:
# 当只有一个agg function的时候，可以用.groupby()[].aggfunc(), aggfunc 有 sum, mean, count...
# 多个agg function 计算时使用.groupby()[].agg(func1, func2, ...)
# =============================================================================
# group by & pivot_table
# group by 一般配合agg function 食用
# DAU 计算
DAU = candy_crush.groupby(['dt'])[['player_id']].agg('nunique')
# 关卡难分析
# 每个关卡在玩的玩家人数
byLevel_player = candy_crush.groupby(['level'])['player_id'].nunique()
byLevel = candy_crush.groupby(['level'])[['num_attempts', 'num_success']].agg(['sum', 'mean']) #要使用双括号
# 增加新的列， 使用.loc或者.iloc
byLevel.loc[:, 'User_avg_attemp#'] = byLevel['num_attempts']['sum']/byLevel_player
byLevel.loc[:, 'User_success_rate'] = byLevel['num_success']['sum']/byLevel['num_attempts']['sum']
# 'User_avg_attemp#'与avg 'num_attempts' 的差异， 前者为每天每个user 尝试的平均次数，后者为总体上user的平均尝试次数
# 'User_success_rate' is highly correlated with 'User_avg_attemp#'
import matplotlib.pyplot as plt
byLevel[['User_avg_attemp#', 'User_success_rate']].plot(x = 'User_avg_attemp#', y = 'User_success_rate', kind = "scatter")
plt.show()# 歪个话题，记几个添加列的方法
# =============================================================================
# 1. 直接命名并指定value
# from datetime import datetime
# candy_crush['create_time'] = datetime.strftime(datetime.now(), "%Y-%m-%d %H:%M:%S")
# candy_crush.drop(['create_time'], axis = 1, inplace = True) # axis = 0 是删除行，设置inplace = True 对原数据做修改
# 2. insert 方法， 第一个参数指定插入列的位置，第二个指定列名，第三个指定插入的值
# candy_crush.insert(0,'ID', range(candy_crush.shape[0]))
# 3. 直接赋值
# candy_crus['new_columns'] = value
# 4. reindex 并指定fill_value; 不是常规用法，需要列出所有的列名（包括新增列名），并且fill_value会把原有列中的缺失值都替换掉
# candy_crush.reindex(columns = [], fill_value = )
# 5. concat 方法，用户横向表拼接，参见"Merge/concat/join tables"
# 6. iloc/locf方法
# =============================================================================# =============================================================================
# pivot_table and melt functions
# =============================================================================
# pivot table必须有index
# 纯属为了pivot而pivot, 没什么分析的目的
pivot_candy = candy_crush.pivot_table(index = ['dt'], \columns = ['level'], values = ['num_attempts', 'num_success'],\aggfunc = 'sum',fill_value = 0)
# pivot之后column出现multi-index, 可用droplevel function drop multi-level
# pivot_candy = pivot_candy.droplevel(None, axis = 1)  # pd.melt(pd.DataFrame, ...)
# 或 使用 pd.DataFrame.melt(id_vars, value_vars, var_name, value_name)
# --id_vars:不需要被转换的列名。
# --value_vars:需要转换的列名，如果剩下的列全部都要转换，就不用写了。
# --var_name和value_name是自定义设置对应的列名。
# --col_level :如果列是MultiIndex，则使用此级别。
#
unpivot_candy = pivot_candy.melt(var_name = ['Statue', 'level'] , value_name = 'number of times')# =============================================================================
# Pandas: Data manipulation
# =============================================================================
left = pd.DataFrame({"key1": ["K0", "K0", "K1", "K2"],"key2": ["K0", "K1", "K0", "K1"],"A": ["A0", "A1", "A2", "A3"],"B": ["B0", "B1", "B2", "B3"],}
)right = pd.DataFrame({"key3": ["K0", "K1", "K1", "K2"],"key4": ["K0", "K0", "K0", "K0"],"C": ["C0", "C1", "C2", "C3"],"D": ["D0", "D1", "D2", "D3"],}
)# Merge/concat/join tables
# 1. merge function, 类似SQL 中的join 功能
# pd.merge(table1, table2) 和table1.merge(table2) 都可以
# 当key在两个表中同名，可以用 on = key_name, 当两表中的key不同名，用left_on 和right_on, 如果都不写默认使用同名的column
# 作为merge的key. suffixes用来给列名加后缀，用于除了merge key 之外的重名的列
table = left.merge(right, how = 'left', left_on = ["key1", "key2"], right_on = ["key3", "key4"], \suffixes = ("_left", "_right"))
# 2. pd.concat([table12, table2, ...], axis = 0/1, keys = None, join = "outer" ...)
# concat 可以横向连接或纵向连接多张表，axis默认为0, 即纵向连接, 两张表没有ovelap的列默认用NaN填充
# 默认join方式为outer, 可选择"outer" 或"inner"
pd.concat([left, right])
pd.concat([left, right], axis = 1) # 一般pd.concat用于纵向连接，横向连接当前版本不可以设置join key
# 3. pd.DataFrame.join(pd.DataFrame, on=None, how='left', lsuffix=' ', rsuffix=' ', sort=False)
# 参数意义与merge基本一直，join 默认为left连接方式
# 用于无重复列名的两表基于行索引的按行拼接（横向连接），如果两表中有重复列名，可set lsuffix和 rsuffi参数
# 也可进行列索引的连接，df1.join(df2.set_index(key of df2), on='key of df1'), 但结果会drop df2 的index
left.join(right.set_index(["key3", "key4"]), on = ["key1", "key2"])# Count non-NA cells for each column or row.
# The values `None`, `NaN`, `NaT`, and optionally `numpy.inf` (depending
# on `pandas.options.mode.use_inf_as_na`) are considered NA.
table.count()
# Count NA cells for Customer ID
# pd.DataFrame.isna()
# pd.DataFrame.isnull()
table.isna().sum()# Finding Missing Data
# Nan||None||NaT||Null
# Nan: Not a Number, NaN是numpy\pandas下的，不是Python原生的.
# None: None不同于空列表和空字符串，是一种单独的格式
# NaT: Not a Time, 该值可以存储在 datetime 数组中以指示未知或缺失的 datetime 值。
# NaT 该值可以存储在 datetime 数组中以指示未知或缺失的 datetime 值,返回一个 (NaT) datetime 非时间标量值.
import numpy as np
type(np.NaN)
type(None)
type(np.nan)# =============================================================================
# 其他用法
# =============================================================================
# 查看数据类型
candy_crush.dtypes
# 或者使用info()
candy_crush.info()# delete row/column
# candy_crush.drop([index/columns list])# sort rows
# if you want to change the order of the rows. You can sort the rows by
# passing a column name to .sort_values()
# candy_crush.sort_values('', inplace = True)# drop duplicates
# pd.DataFrame.drop_duplicates()# =============================================================================
# Write DataFrame into CSV file
# =============================================================================
# candy_crush.to_csv(dir)

candycrush游戏数据-统计分析文档类资源-CSDN下载

pandas 处理数据表常用公式相关推荐

pandas 判断数据表中的缺失值
pandas用any可以判断数据表中是否有缺失值. 使用 df_new.isnull().any() df_new.isnull().any() 这样就可以显示是否有那个变量含有确实值了.
如何提取pandas.DataFrame数据表中某一列的类别
在数据处理中遇到一个问题:如何提取数据表中某一列的类别,也就是汇总多少种类. 问题很简单,有以下两种方法,个人推荐第2种. 1.直接对数据表进行透视,然后取出类别的一列 df_pivot = df.p ...
python里面pandas对数据表的变量重新赋值，将满意，不满意的李克特量表赋值为数字
pandas里,将李克特量表的字符串,比如赋值为5, 4, 3, 2, 1 方法如下: 定义一个字典 y_dict = {"很满意":5,"满意":4,&qu ...
pandas之数据合并
在数据处理中,不免会遇到数据之间的合并.学过关系数据库的童鞋应该都清楚数据表之间的连接.今天要说的数据合并其实和数据表之间的连接有很多相似之处.由于 pandas 库在数据分析中比较方便而且用者较多, ...
MySQL-基础-数据库和数据表
#一数据库 #####常用语句 -查看所有库SHOW DATABASES;-使用数据库USE 数据库名-查看当前所在数据库SELECT DATABASE();-查看当前时间.用户名.数据库版本SEL ...
【转载】使用Pandas创建数据透视表
使用Pandas创建数据透视表本文转载自:蓝鲸的网站分析笔记原文链接:使用Pandas创建数据透视表目录 pandas.pivot_table() 创建简单的数据透视表增加一个行维度(inde ...
tp5.1 获取表里的所有数据_一个公式，将数据提取到指定工作表
亲爱的表亲好: 又到了学习函数的时间了,今天继续和大家分享CELL函数的知识.期待能收到一朵小花还有无数多情的目光. 在日常应用中,从总表中拆分数据还是经常会用到的.比如说,将销售数据提取到各个销售部 ...
pandas中set_option的常用设置：显示所有行、显示所有列、控制浮点型精度、每个数据元素的显示字符数、对齐形式等
pandas中set_option的常用设置:显示所有行.显示所有列.控制浮点型精度.每个数据元素的显示字符数.对齐形式等 #pandas中set_option的常用设置详细参考pandas API ...
python使用pandas基于时间条件查询多个oracle数据表
python使用pandas基于时间条件查询多个oracle数据表目录 python使用pandas基于时间条件查询多个orcale数据表 #orcale数据连接

pandas 处理数据表常用公式

pandas 处理数据表常用公式相关推荐

最新文章

热门文章