文章目录
- pandas入门
- 矩阵运算
- 重建索引
- 统计
- 数据合并
- 分组统计
- 数据整形
- 数据透视表
- 时间序列
- 类别数据
- 画图
- 数据读写
- 电影数据分析
- 准备工作
- 数据说明
- 利用 Pandas 分析电影评分数据
- 数据合并
- pandas核心数据结构
- Series
- Series特性
- Series 是类 ndarray 对象
- Series 是类字典对象
- 标签对齐操作
- name 属性
- DataFrame
- 从结构化数据中创建
- 从字典列表创建
- 从元组字典创建
- 从 Series 创建
- 列选择/增加/删除
- 使用 assign() 方法来插入新列
- 索引和选择
- 数据对齐
- 使用 numpy 函数
- Panel
- 基础运算
- 重新索引
- 丢弃部分数据
- 广播运算
- 函数应用
- 排序和排名
- 数据唯一性及成员资格
pandas入门
首先,下面所有的例子均需要导入包:
# 设置为 inline 风格
%matplotlib inline
# 包导入
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
一维数组:Series:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
输出:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
二维数组创建:
dates = pd.date_range('20160301', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
输出
|
A
|
B
|
C
|
D
|
2016-03-01
|
-0.859805
|
-0.069692
|
-0.905092
|
-0.553213
|
2016-03-02
|
-0.353785
|
0.031793
|
-0.785213
|
-0.212337
|
2016-03-03
|
1.719976
|
0.925145
|
0.241639
|
-0.490166
|
2016-03-04
|
-1.207854
|
-0.001647
|
-0.468976
|
-0.781144
|
2016-03-05
|
0.452034
|
1.371208
|
1.152729
|
1.470498
|
2016-03-06
|
1.378227
|
0.246941
|
-1.186630
|
-0.411647
|
df.valuesarray([[-8.59804740e-01, -6.96922374e-02, -9.05091676e-01,-5.53212518e-01],[-3.53785450e-01, 3.17933613e-02, -7.85212585e-01,-2.12337229e-01],[ 1.71997643e+00, 9.25144720e-01, 2.41639347e-01,-4.90166361e-01],[-1.20785422e+00, -1.64720630e-03, -4.68976209e-01,-7.81144372e-01],[ 4.52033577e-01, 1.37120779e+00, 1.15272905e+00,1.47049771e+00],[ 1.37822701e+00, 2.46941166e-01, -1.18662963e+00,-4.11647030e-01]])
用字典来构建DataFrame,如下:
# 使用字典来创建:key 为 DataFrame 的列;value 为对应列下的值
df1 = pd.DataFrame({'A': 1,'B': pd.Timestamp('20160301'),'C': range(4),'D': np.arange(5, 9),'E': 'text','F': ['AA', 'BB', 'CC', 'DD']})
df1
|
A
|
B
|
C
|
D
|
E
|
F
|
0
|
1
|
2016-03-01
|
0
|
5
|
text
|
AA
|
1
|
1
|
2016-03-01
|
1
|
6
|
text
|
BB
|
2
|
1
|
2016-03-01
|
2
|
7
|
text
|
CC
|
3
|
1
|
2016-03-01
|
3
|
8
|
text
|
DD
|
各个列可以通过DataFrame的属性来访问,如
df.A、df.B
type(df.A)
pandas.core.series.Series
可以看出,每一列或者每一行,都是Series数据
df.types #查看每一列的数据类型
df.shape #查看数据区的矩阵行列数
df.head() #查看矩阵的前五行,在函数内指定数字,可以查看指定行数
df.tail() #查看矩阵的后五行
df.index #矩阵的行索引
df.columns#矩阵的列索引
df.values #查看矩阵的值,返回的是一个ndarray类型矩阵
df.describe()#查看矩阵的统计信息,默认是axis=0,及每一列的各行数据参与运算,生成每一列的统计信息,包括(每一列有多少值,平均值,标准差,最小值,四分位,中位数,四分之三分位,最大值),其中四分位的计算方法是减1计算法,具体可以查看百度百科
df.T #矩阵的转置,或者使用transpose,类似numpy的方法
df.sort_index()#将行索引排序,参数axis可以指定为1,对列索引排序,ascending表示降序还是升序
df.sort_values(by='A')#依据每一列的值进行排序
df.['A'] #选择某列
df.A
df[2:4] #选择行,数字也可以换成行索引,布尔值索引,但当行索引也是数字时,默认使用数字,使用索引的切片,是包含最后一个元素所在的行的
df.loc[] #使用索引来查找数据
df.iloc[] #使用数字来查找数据
#例子
df.loc[:,['B','C']],也可以访问某个特定的值
#
df.at[,] #根据行列标签(索引)选择数据
df.iat[,] #根据数字选择数据df[df.A>0] #根据布尔索引,找出A列大于0的行
df[df>0] #非零的数据为NaN
矩阵运算
df['tag'] = ['a'] * 2 + ['b'] * 2 + ['c'] * 2
|
A
|
B
|
C
|
D
|
tag
|
2016-03-01
|
0.033228
|
2.307123
|
-0.585367
|
-1.671832
|
a
|
2016-03-02
|
-1.967299
|
0.727670
|
-0.190863
|
-0.163514
|
a
|
2016-03-03
|
0.065359
|
0.696804
|
-0.550040
|
0.717347
|
b
|
2016-03-04
|
0.234850
|
0.289520
|
-1.087173
|
1.534277
|
b
|
2016-03-05
|
-1.459620
|
1.040987
|
0.220130
|
-0.068131
|
c
|
2016-03-06
|
0.865402
|
2.650889
|
-0.015460
|
-0.111889
|
c
|
df[df.tag.isin(['a', 'c'])]
|
A
|
B
|
C
|
D
|
tag
|
2016-03-01
|
0.033228
|
2.307123
|
-0.585367
|
-1.671832
|
a
|
2016-03-02
|
-1.967299
|
0.727670
|
-0.190863
|
-0.163514
|
a
|
2016-03-05
|
-1.459620
|
1.040987
|
0.220130
|
-0.068131
|
c
|
2016-03-06
|
0.865402
|
2.650889
|
-0.015460
|
-0.111889
|
c
|
### 修改元素
s = pd.Series(np.arange(6), index=pd.date_range('20160301', periods=6))
df['E'] = s # 修改一列
df.at[pd.Timestamp('20160301'), 'A'] = 0.4 # 修改某个元素
df.B = 200 #将一列的值都修改成一个值
df.iloc[:,2:5] = 1000 #修改子表
要结束网页的jupyter notebook,在控制台两次ctrl+c
重建索引
dates = pd.date_range('20160301', periods=6)
df = pd.DataFrame(data=np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
|
A
|
B
|
C
|
D
|
2016-03-01
|
-0.985666
|
0.240058
|
0.716721
|
0.352009
|
2016-03-02
|
-1.563644
|
0.091766
|
1.081764
|
0.951541
|
2016-03-03
|
0.279760
|
-0.316136
|
1.198073
|
-0.562947
|
2016-03-04
|
1.174777
|
-0.225305
|
-0.280256
|
-0.074768
|
2016-03-05
|
2.173366
|
0.907038
|
-1.104678
|
-0.921779
|
2016-03-06
|
0.200422
|
0.442619
|
1.970330
|
-0.609867
|
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1
|
A
|
B
|
C
|
D
|
E
|
2016-03-01
|
-0.985666
|
0.240058
|
0.716721
|
0.352009
|
NaN
|
2016-03-02
|
-1.563644
|
0.091766
|
1.081764
|
0.951541
|
NaN
|
2016-03-03
|
0.279760
|
-0.316136
|
1.198073
|
-0.562947
|
NaN
|
2016-03-04
|
1.174777
|
-0.225305
|
-0.280256
|
-0.074768
|
NaN
|
### 处理丢失数据
df1.loc[dates[1:3], 'E'] = 1
df1
|
A
|
B
|
C
|
D
|
E
|
2016-03-01
|
-0.985666
|
0.240058
|
0.716721
|
0.352009
|
NaN
|
2016-03-02
|
-1.563644
|
0.091766
|
1.081764
|
0.951541
|
1
|
2016-03-03
|
0.279760
|
-0.316136
|
1.198073
|
-0.562947
|
1
|
2016-03-04
|
1.174777
|
-0.225305
|
-0.280256
|
-0.074768
|
NaN
|
此时的E列有两个空值
df1.dropna() #去除了空值所在的行
df1.fillna(value=5) #将空值替换成5
pd.isnull(df1) #判断是否有空数据
pd.isnull(df1).any()#列中是否有空数据
pd.isnull(df1).any().any()#表中是否有空数据
统计
空数据是不参与统计计算的
df1.mean() #各列的平均值
df.sum(axis='columns') #计算加和
df.cumsum() #计算累加值
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2) #shift表示数据段整体往后移动两个位置,前面空出来的位置为NaN
df.sub(s, axis='index')#每个列都减处理,控制行变成空值,一般都是对空值做预处理在减
df.apply(np.cumsum) #效果同df.cumsum(),注意这里的sumsum属于np模块,不是pd模块,apply是吧一个列作为参数传给函数,applymap是把每个元素做处理
df.apply(lambda x: x.max() - x.min()) #得出每一列最大值减最小值
s = pd.Series(np.random.randint(0, 7, size=10)) ¥不包括最后一个数,但python自带函数是包括的
s.value_counts() #统计各种数字的数量
s.mode() #最多的数是哪些
数据合并
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
|
A
|
B
|
C
|
D
|
0
|
1.098103
|
-0.843356
|
-0.379135
|
0.419353
|
1
|
-0.177702
|
-0.225926
|
-0.363542
|
-0.153022
|
2
|
1.938231
|
0.154881
|
0.291382
|
0.152774
|
3
|
-0.460645
|
-0.268697
|
-1.509469
|
0.698776
|
4
|
-0.397048
|
-0.958223
|
0.212833
|
-0.435485
|
5
|
0.525406
|
-0.177595
|
0.453216
|
-0.093792
|
6
|
0.531912
|
-0.832667
|
0.200721
|
0.943878
|
7
|
-0.740845
|
0.098634
|
0.274020
|
1.671997
|
8
|
2.182379
|
1.729010
|
1.306269
|
0.580677
|
9
|
-0.031538
|
0.159714
|
0.736667
|
-0.122326
|
df1 = pd.concat([df.iloc[:3], df.iloc[3:7], df.iloc[7:]])
(df == df1).all().all() #判断两个矩阵是否相等# SQL 样式的联合查询
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
# SELECT * FROM left INNER JOIN right ON left.key = right.key;
pd.merge(left, right, on='key')s = pd.Series(np.random.randint(1, 5, size=4), index=list('ABCD'))
df.append(s, ignore_index=True)
如果s有五列,在插入后,多出来新的一列,除了最后一行以外,其他都是NaN
分组统计
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
df
|
A
|
B
|
C
|
D
|
0
|
foo
|
one
|
-0.580320
|
-1.460149
|
1
|
bar
|
one
|
1.471201
|
-1.079598
|
2
|
foo
|
two
|
0.094836
|
1.513204
|
3
|
bar
|
three
|
-1.498810
|
0.754968
|
4
|
foo
|
two
|
0.180709
|
0.415266
|
5
|
bar
|
two
|
0.358515
|
-0.341988
|
6
|
foo
|
one
|
-0.121082
|
-0.408148
|
7
|
foo
|
three
|
0.404648
|
-0.320882
|
df.groupby('A').sum()
|
C
|
D
|
A
|
|
|
bar
|
0.330906
|
-0.666618
|
foo
|
-0.021208
|
-0.260709
|
df.groupby(['A', 'B']).sum()
|
|
C
|
D
|
A
|
B
|
|
|
bar
|
one
|
1.471201
|
-1.079598
|
three
|
-1.498810
|
0.754968
|
two
|
0.358515
|
-0.341988
|
foo
|
one
|
-0.701402
|
-1.868297
|
three
|
0.404648
|
-0.320882
|
two
|
0.275545
|
1.928470
|
形成了双索引的结构
数据整形
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
|
|
A
|
B
|
first
|
second
|
|
|
bar
|
one
|
0.072026
|
0.422077
|
two
|
-1.099181
|
-0.354796
|
baz
|
one
|
1.285500
|
-1.185525
|
two
|
0.645316
|
-0.660115
|
foo
|
one
|
0.696443
|
-1.664527
|
two
|
0.718399
|
-0.154125
|
qux
|
one
|
-0.740052
|
0.713089
|
two
|
-0.672748
|
-1.346843
|
stacked = df.stack()
first second
bar one A 0.072026B 0.422077two A -1.099181B -0.354796
baz one A 1.285500B -1.185525two A 0.645316B -0.660115
foo one A 0.696443B -1.664527two A 0.718399B -0.154125
qux one A -0.740052B 0.713089two A -0.672748B -1.346843
dtype: float64
stacked.unstack() #装换成以前的样子了
stacked.unstack().unstack()
|
A
|
B
|
second
|
one
|
two
|
one
|
two
|
first
|
|
|
|
|
bar
|
0.072026
|
-1.099181
|
0.422077
|
-0.354796
|
baz
|
1.285500
|
0.645316
|
-1.185525
|
-0.660115
|
foo
|
0.696443
|
0.718399
|
-1.664527
|
-0.154125
|
qux
|
-0.740052
|
-0.672748
|
0.713089
|
-1.346843
|
#可以指明选择哪一列作为列索引
stacked.unstack(1)
<div>
<table border="1" class="dataframe"><thead><tr style="text-align: right;"><th></th><th>second</th><th>one</th><th>two</th></tr><tr><th>first</th><th></th><th></th><th></th></tr></thead><tbody><tr><th rowspan="2" valign="top">bar</th><th>A</th><td>0.072026</td><td>-1.099181</td></tr><tr><th>B</th><td>0.422077</td><td>-0.354796</td></tr><tr><th rowspan="2" valign="top">baz</th><th>A</th><td>1.285500</td><td>0.645316</td></tr><tr><th>B</th><td>-1.185525</td><td>-0.660115</td></tr><tr><th rowspan="2" valign="top">foo</th><th>A</th><td>0.696443</td><td>0.718399</td></tr><tr><th>B</th><td>-1.664527</td><td>-0.154125</td></tr><tr><th rowspan="2" valign="top">qux</th><th>A</th><td>-0.740052</td><td>-0.672748</td></tr><tr><th>B</th><td>0.713089</td><td>-1.346843</td></tr></tbody>
</table>
</div>
数据透视表
数据透视是指查看里面的一部分数据
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,'B' : ['A', 'B', 'C'] * 4,'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,'D' : np.random.randn(12),'E' : np.random.randn(12)})
df
|
A
|
B
|
C
|
D
|
E
|
0
|
one
|
A
|
foo
|
1.477533
|
1.557713
|
1
|
one
|
B
|
foo
|
0.019528
|
2.483014
|
2
|
two
|
C
|
foo
|
-0.912452
|
0.409732
|
3
|
three
|
A
|
bar
|
0.502807
|
-0.462401
|
4
|
one
|
B
|
bar
|
1.709597
|
-1.739413
|
5
|
one
|
C
|
bar
|
-0.658155
|
1.302735
|
6
|
two
|
A
|
foo
|
0.007806
|
0.782926
|
7
|
three
|
B
|
foo
|
-0.067922
|
-0.193820
|
8
|
one
|
C
|
foo
|
0.806713
|
0.383870
|
9
|
one
|
A
|
bar
|
0.794017
|
0.749756
|
10
|
two
|
B
|
bar
|
-0.532554
|
-0.811900
|
11
|
three
|
C
|
bar
|
0.464731
|
1.168423
|
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
|
C
|
bar
|
foo
|
A
|
B
|
|
|
one
|
A
|
0.794017
|
1.477533
|
B
|
1.709597
|
0.019528
|
C
|
-0.658155
|
0.806713
|
three
|
A
|
0.502807
|
NaN
|
B
|
NaN
|
-0.067922
|
C
|
0.464731
|
NaN
|
two
|
A
|
NaN
|
0.007806
|
B
|
-0.532554
|
NaN
|
C
|
NaN
|
-0.912452
|
**当这个数据透视表对应多个值时,回去取平均值**
时间序列
rng = pd.date_range('20160301', periods=600, freq='s')
rng
DatetimeIndex(['2016-03-01 00:00:00', '2016-03-01 00:00:01','2016-03-01 00:00:02', '2016-03-01 00:00:03','2016-03-01 00:00:04', '2016-03-01 00:00:05','2016-03-01 00:00:06', '2016-03-01 00:00:07','2016-03-01 00:00:08', '2016-03-01 00:00:09',...'2016-03-01 00:09:50', '2016-03-01 00:09:51','2016-03-01 00:09:52', '2016-03-01 00:09:53','2016-03-01 00:09:54', '2016-03-01 00:09:55','2016-03-01 00:09:56', '2016-03-01 00:09:57','2016-03-01 00:09:58', '2016-03-01 00:09:59'],dtype='datetime64[ns]', length=600, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
2016-03-01 00:00:00 34
2016-03-01 00:00:01 4
2016-03-01 00:00:02 382
2016-03-01 00:00:03 164
2016-03-01 00:00:04 178
2016-03-01 00:00:05 421
2016-03-01 00:00:06 34
2016-03-01 00:00:07 71
2016-03-01 00:00:08 316
2016-03-01 00:00:09 201
2016-03-01 00:00:10 214
2016-03-01 00:00:11 443
2016-03-01 00:00:12 185
2016-03-01 00:00:13 79
2016-03-01 00:00:14 38
2016-03-01 00:00:15 465
2016-03-01 00:00:16 309
2016-03-01 00:00:17 93
2016-03-01 00:00:18 20
2016-03-01 00:00:19 338
2016-03-01 00:00:20 149
2016-03-01 00:00:21 34
2016-03-01 00:00:22 257
2016-03-01 00:00:23 462
2016-03-01 00:00:24 41
2016-03-01 00:00:25 471
2016-03-01 00:00:26 313
2016-03-01 00:00:27 224
2016-03-01 00:00:28 78
2016-03-01 00:00:29 498...
2016-03-01 00:09:30 61
2016-03-01 00:09:31 315
2016-03-01 00:09:32 388
2016-03-01 00:09:33 391
2016-03-01 00:09:34 263
2016-03-01 00:09:35 11
2016-03-01 00:09:36 61
2016-03-01 00:09:37 400
2016-03-01 00:09:38 109
2016-03-01 00:09:39 135
2016-03-01 00:09:40 267
2016-03-01 00:09:41 248
2016-03-01 00:09:42 469
2016-03-01 00:09:43 155
2016-03-01 00:09:44 284
2016-03-01 00:09:45 168
2016-03-01 00:09:46 228
2016-03-01 00:09:47 244
2016-03-01 00:09:48 442
2016-03-01 00:09:49 450
2016-03-01 00:09:50 226
2016-03-01 00:09:51 370
2016-03-01 00:09:52 192
2016-03-01 00:09:53 325
2016-03-01 00:09:54 82
2016-03-01 00:09:55 154
2016-03-01 00:09:56 285
2016-03-01 00:09:57 22
2016-03-01 00:09:58 48
2016-03-01 00:09:59 171
Freq: S, dtype: int32
#重新采样
ts.resample('2Min', how='sum')
2016-03-01 00:00:00 28595
2016-03-01 00:02:00 29339
2016-03-01 00:04:00 28991
2016-03-01 00:06:00 30789
2016-03-01 00:08:00 30131
Freq: 2T, dtype: int32
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2','1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4','1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2','1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4','1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2','1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4','1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2','2000Q3', '2000Q4'],dtype='int64', freq='Q-NOV')
prng.to_timestamp() #转成标准格式数据 年-月-日
pd.Timestamp('20160301')-pd.Timestamp('2016') #计算时间差
类别数据
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
|
id
|
raw_grade
|
0
|
1
|
a
|
1
|
2
|
b
|
2
|
3
|
b
|
3
|
4
|
a
|
4
|
5
|
a
|
5
|
6
|
e
|
df["grade"] = df["raw_grade"].astype("category")
df
|
id
|
raw_grade
|
grade
|
0
|
1
|
a
|
a
|
1
|
2
|
b
|
b
|
2
|
3
|
b
|
b
|
3
|
4
|
a
|
a
|
4
|
5
|
a
|
a
|
5
|
6
|
e
|
e
|
df["grade"].cat.categories
Index([u'a', u'b', u'e'], dtype='object')
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
|
id
|
raw_grade
|
grade
|
0
|
1
|
a
|
very good
|
1
|
2
|
b
|
good
|
2
|
3
|
b
|
good
|
3
|
4
|
a
|
very good
|
4
|
5
|
a
|
very good
|
5
|
6
|
e
|
very bad
|
df.sort_values(by='grade', ascending=True)
|
id
|
raw_grade
|
grade
|
0
|
1
|
a
|
very good
|
3
|
4
|
a
|
very good
|
4
|
5
|
a
|
very good
|
1
|
2
|
b
|
good
|
2
|
3
|
b
|
good
|
5
|
6
|
e
|
very bad
|
画图
ts = pd.Series(np.random.randn(1000), index=pd.date_range('20000101', periods=1000))
ts = ts.cumsum()
ts
2000-01-01 0.416424
2000-01-02 0.603304
2000-01-03 -0.237965
2000-01-04 0.317450
2000-01-05 0.665045
2000-01-06 2.468087
2000-01-07 2.758852
2000-01-08 2.271343
2000-01-09 3.129609
2000-01-10 5.171241
2000-01-11 5.049896
2000-01-12 5.185316
2000-01-13 4.169058
2000-01-14 2.862306
2000-01-15 4.018617
2000-01-16 4.456694
2000-01-17 5.824236
2000-01-18 6.094983
2000-01-19 5.880954
2000-01-20 5.875111
2000-01-21 6.008481
2000-01-22 6.835501
2000-01-23 7.480405
2000-01-24 6.849335
2000-01-25 7.608887
2000-01-26 9.029474
2000-01-27 8.859222
2000-01-28 7.162806
2000-01-29 7.398013
2000-01-30 7.391844...
2002-08-28 21.728409
2002-08-29 21.757852
2002-08-30 21.047643
2002-08-31 20.114996
2002-09-01 18.769902
2002-09-02 17.417680
2002-09-03 17.917688
2002-09-04 18.064786
2002-09-05 19.312356
2002-09-06 18.633479
2002-09-07 17.711879
2002-09-08 19.162369
2002-09-09 19.697896
2002-09-10 18.895018
2002-09-11 18.590989
2002-09-12 17.278925
2002-09-13 17.730168
2002-09-14 19.058526
2002-09-15 18.898382
2002-09-16 17.048621
2002-09-17 16.443233
2002-09-18 16.842284
2002-09-19 14.627031
2002-09-20 15.500982
2002-09-21 14.640444
2002-09-22 13.183795
2002-09-23 13.383657
2002-09-24 13.006229
2002-09-25 12.311008
2002-09-26 11.674804
Freq: D, dtype: float64
ts.plot()
![](/assets/blank.gif)
数据读写
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df
|
A
|
B
|
C
|
D
|
0
|
-1.052421
|
-0.164992
|
3.098604
|
-0.966960
|
1
|
1.194177
|
0.086880
|
0.496095
|
0.265308
|
2
|
0.297724
|
1.284297
|
-0.130855
|
-0.229570
|
3
|
-0.787063
|
0.553680
|
0.546853
|
-0.322599
|
4
|
0.033174
|
-1.222281
|
0.320090
|
-1.749333
|
5
|
0.109575
|
0.310684
|
1.620296
|
-0.928869
|
6
|
0.761408
|
-0.027630
|
0.458341
|
-0.785370
|
7
|
-1.150479
|
-0.718584
|
1.028866
|
0.419026
|
8
|
-2.906881
|
-0.295700
|
-0.342306
|
-0.765172
|
9
|
0.916363
|
-1.181429
|
-1.559657
|
-1.171191
|
10
|
0.578659
|
0.804726
|
1.299496
|
0.176843
|
11
|
0.150659
|
-0.162833
|
-1.086055
|
1.240432
|
12
|
-0.819219
|
1.668234
|
0.217604
|
-0.779170
|
13
|
-0.550658
|
-0.672640
|
-0.674157
|
-0.637602
|
14
|
0.901584
|
0.046023
|
0.244370
|
0.374293
|
15
|
0.971181
|
-0.442618
|
0.179083
|
0.086095
|
16
|
-0.570786
|
-1.019239
|
1.684833
|
0.539140
|
17
|
-1.432314
|
1.369588
|
2.091300
|
0.733526
|
18
|
-1.115526
|
-0.115884
|
2.636074
|
-0.788859
|
19
|
1.601554
|
1.226182
|
0.169308
|
-0.616585
|
20
|
0.571316
|
0.542432
|
0.306595
|
0.780939
|
21
|
-0.540414
|
1.036656
|
0.683224
|
-0.116963
|
22
|
1.319110
|
-1.265207
|
1.371924
|
0.881560
|
23
|
1.584346
|
-1.719633
|
-1.365020
|
-0.617224
|
24
|
-0.440420
|
-0.799265
|
0.376128
|
-0.654581
|
25
|
-0.261730
|
-0.046325
|
-0.289009
|
0.505634
|
26
|
0.385047
|
0.112723
|
0.428345
|
-0.008455
|
27
|
-0.921668
|
1.609848
|
1.592532
|
-0.623103
|
28
|
0.280799
|
-0.231821
|
-1.589829
|
-1.791286
|
29
|
0.661562
|
0.621305
|
0.921586
|
-0.312834
|
...
|
...
|
...
|
...
|
...
|
70
|
0.064385
|
0.669585
|
-1.347073
|
0.941348
|
71
|
-1.534420
|
-1.227736
|
0.459771
|
-1.150254
|
72
|
0.010741
|
0.062820
|
-1.098301
|
1.268482
|
73
|
-1.183586
|
1.159889
|
-0.186617
|
-0.847210
|
74
|
-0.705815
|
-0.371896
|
0.313020
|
0.035314
|
75
|
-2.945315
|
-0.421227
|
-0.403479
|
1.387825
|
76
|
-0.122383
|
0.474282
|
-2.039155
|
-0.155960
|
77
|
0.921353
|
-0.430436
|
-0.599253
|
0.911030
|
78
|
0.018444
|
0.098611
|
0.320480
|
0.001282
|
79
|
-0.188301
|
-2.015690
|
-0.427172
|
-0.146939
|
80
|
-0.006022
|
0.213421
|
1.358382
|
-0.414890
|
81
|
0.596546
|
0.042708
|
1.325342
|
-0.800222
|
82
|
-1.736245
|
-0.056213
|
-0.415892
|
-0.360570
|
83
|
0.463591
|
-0.404202
|
0.577191
|
0.336023
|
84
|
-1.397557
|
0.442012
|
0.007915
|
-1.305628
|
85
|
-0.137766
|
-0.771713
|
0.200956
|
-0.365344
|
86
|
0.988833
|
-0.165965
|
-0.893573
|
-0.318324
|
87
|
1.093799
|
1.694406
|
-0.868420
|
0.100202
|
88
|
-0.240628
|
0.539268
|
-1.094841
|
1.737569
|
89
|
1.850923
|
-0.472270
|
-2.317345
|
-0.544395
|
90
|
0.617284
|
1.224130
|
-1.722366
|
0.236574
|
91
|
1.282967
|
0.738570
|
1.748848
|
-0.106646
|
92
|
0.775707
|
-0.494293
|
-1.098466
|
0.372206
|
93
|
-0.846466
|
0.735144
|
1.456520
|
1.622817
|
94
|
-0.860999
|
1.146650
|
-1.064013
|
1.400919
|
95
|
-0.095498
|
-1.849518
|
2.303532
|
0.688425
|
96
|
-0.017921
|
-0.558700
|
-1.061605
|
0.781250
|
97
|
-1.069070
|
1.106837
|
-1.936800
|
-0.782616
|
98
|
0.436267
|
0.463537
|
0.614982
|
-0.123774
|
99
|
-1.440635
|
-1.506836
|
-0.386824
|
1.118260
|
100 rows × 4 columns
df.to_csv('data.csv')
%ls
%more data.csv
# pd.read_csv('data.csv')
pd.read_csv('data.csv', index_col=0)#以第一列作为索引
电影数据分析
准备工作
从网站 grouplens.org/datasets/movielens 下载 MovieLens 1M Dataset 数据。
数据说明
参阅数据介绍文件 README.txt
利用 Pandas 分析电影评分数据
- 数据读取
- 数据合并
- 统计电影平均得分
- 统计活跃电影 -> 获得评分的次数越多说明电影越活跃
- 女生最喜欢的电影排行榜
- 男生最喜欢的电影排行榜
- 男女生评分差距最大的电影 -> 某类电影女生喜欢,但男生不喜欢
- 最具争议的电影排行榜 -> 评分的方差最大
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
user_names = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=user_names, engine='python')rating_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rating_names, engine='python')movie_names = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=movie_names, engine='python')
pandas内部有两种实现方式,一种是C语言,一种是python,这里指定python,因为C语言功能不如python多。
print len(users)
users.head(5)
6040
|
user_id
|
gender
|
age
|
occupation
|
zip
|
0
|
1
|
F
|
1
|
10
|
48067
|
1
|
2
|
M
|
56
|
16
|
70072
|
2
|
3
|
M
|
25
|
15
|
55117
|
3
|
4
|
M
|
45
|
7
|
02460
|
4
|
5
|
M
|
25
|
20
|
55455
|
数据合并
data = pd.merge(pd.merge(users, ratings), movies)
len(data)
data.head(5)
|
user_id
|
gender
|
age
|
occupation
|
zip
|
movie_id
|
rating
|
timestamp
|
title
|
genres
|
0
|
1
|
F
|
1
|
10
|
48067
|
1193
|
5
|
978300760
|
One Flew Over the Cuckoo's Nest (1975)
|
Drama
|
1
|
2
|
M
|
56
|
16
|
70072
|
1193
|
5
|
978298413
|
One Flew Over the Cuckoo's Nest (1975)
|
Drama
|
2
|
12
|
M
|
25
|
12
|
32793
|
1193
|
4
|
978220179
|
One Flew Over the Cuckoo's Nest (1975)
|
Drama
|
3
|
15
|
M
|
25
|
7
|
22903
|
1193
|
4
|
978199279
|
One Flew Over the Cuckoo's Nest (1975)
|
Drama
|
4
|
17
|
M
|
50
|
1
|
95350
|
1193
|
5
|
978158471
|
One Flew Over the Cuckoo's Nest (1975)
|
Drama
|
data[data.user_id == 1] #查看你用户1对所有电影的评分情况
# 按性别查看各个电影的平均评分
mean_ratings_gender = data.pivot_table(values='rating', index='title', columns='gender', aggfunc='mean')
mean_ratings_gender.head(5)
gender
|
F
|
M
|
title
|
|
|
$1,000,000 Duck (1971)
|
3.375000
|
2.761905
|
'Night Mother (1986)
|
3.388889
|
3.352941
|
'Til There Was You (1997)
|
2.675676
|
2.733333
|
'burbs, The (1989)
|
2.793478
|
2.962085
|
...And Justice for All (1979)
|
3.828571
|
3.689024
|
# 男女意见想差最大的电影 -> 价值观/品味冲突
mean_ratings_gender['diff'] = mean_ratings_gender.F - mean_ratings_gender.M
mean_ratings_gender.head(5)
gender
|
F
|
M
|
diff
|
title
|
|
|
|
$1,000,000 Duck (1971)
|
3.375000
|
2.761905
|
0.613095
|
'Night Mother (1986)
|
3.388889
|
3.352941
|
0.035948
|
'Til There Was You (1997)
|
2.675676
|
2.733333
|
-0.057658
|
'burbs, The (1989)
|
2.793478
|
2.962085
|
-0.168607
|
...And Justice for All (1979)
|
3.828571
|
3.689024
|
0.139547
|
mean_ratings_gender.sort_values(by='diff', ascending=True).head(10)
gender
|
F
|
M
|
diff
|
title
|
|
|
|
Tigrero: A Film That Was Never Made (1994)
|
1
|
4.333333
|
-3.333333
|
Neon Bible, The (1995)
|
1
|
4.000000
|
-3.000000
|
Enfer, L' (1994)
|
1
|
3.750000
|
-2.750000
|
Stalingrad (1993)
|
1
|
3.593750
|
-2.593750
|
Killer: A Journal of Murder (1995)
|
1
|
3.428571
|
-2.428571
|
Dangerous Ground (1997)
|
1
|
3.333333
|
-2.333333
|
In God's Hands (1998)
|
1
|
3.333333
|
-2.333333
|
Rosie (1998)
|
1
|
3.333333
|
-2.333333
|
Flying Saucer, The (1950)
|
1
|
3.300000
|
-2.300000
|
Jamaica Inn (1939)
|
1
|
3.142857
|
-2.142857
|
# 活跃电影排行榜
ratings_by_movie_title = data.groupby('title').size()
ratings_by_movie_title.head(5)
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
dtype: int64
# 前二十大高分电影 -> 平均评分最高的电影
mean_ratings = data.pivot_table(values='rating', index='title', aggfunc='mean')
top_20_mean_ratings = mean_ratings.sort_values(ascending=False).head(20)
top_20_mean_ratings
title
Gate of Heavenly Peace, The (1995) 5.000000
Lured (1947) 5.000000
Ulysses (Ulisse) (1954) 5.000000
Smashing Time (1967) 5.000000
Follow the Bitch (1998) 5.000000
Song of Freedom (1936) 5.000000
Bittersweet Motel (2000) 5.000000
Baby, The (1973) 5.000000
One Little Indian (1973) 5.000000
Schlafes Bruder (Brother of Sleep) (1995) 5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 4.800000
Lamerica (1994) 4.750000
Apple, The (Sib) (1998) 4.666667
Sanjuro (1962) 4.608696
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.560510
Shawshank Redemption, The (1994) 4.554558
Godfather, The (1972) 4.524966
Close Shave, A (1995) 4.520548
Usual Suspects, The (1995) 4.517106
Schindler's List (1993) 4.510417
Name: rating, dtype: float64
# 前十大热闹电影的平均评分 -> 不一定越热闹的电影,评分越高
mean_ratings[top_10_ratings.index]
title
American Beauty (1999) 4.317386
Star Wars: Episode IV - A New Hope (1977) 4.453694
Star Wars: Episode V - The Empire Strikes Back (1980) 4.292977
Star Wars: Episode VI - Return of the Jedi (1983) 4.022893
Jurassic Park (1993) 3.763847
Saving Private Ryan (1998) 4.337354
Terminator 2: Judgment Day (1991) 4.058513
Matrix, The (1999) 4.315830
Back to the Future (1985) 3.990321
Silence of the Lambs, The (1991) 4.351823
Name: rating, dtype: float64
# 前二十大高分电影的热闹程度 -> 不一定评分越高的电影越热闹,可能某个很小众的电影看得人少,但评分很高
ratings_by_movie_title[top_20_mean_ratings.index]
title
Gate of Heavenly Peace, The (1995) 3
Lured (1947) 1
Ulysses (Ulisse) (1954) 1
Smashing Time (1967) 2
Follow the Bitch (1998) 1
Song of Freedom (1936) 1
Bittersweet Motel (2000) 1
Baby, The (1973) 1
One Little Indian (1973) 1
Schlafes Bruder (Brother of Sleep) (1995) 1
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 5
Lamerica (1994) 8
Apple, The (Sib) (1998) 9
Sanjuro (1962) 69
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 628
Shawshank Redemption, The (1994) 2227
Godfather, The (1972) 2223
Close Shave, A (1995) 657
Usual Suspects, The (1995) 1783
Schindler's List (1993) 2304
dtype: int64
# 十大好电影 -> 活跃度超过 1000 的高分电影
top_10_movies = mean_ratings[top_ratings.index].sort_values(ascending=False).head(10)
top_10_movies
title
Shawshank Redemption, The (1994) 4.554558
Godfather, The (1972) 4.524966
Usual Suspects, The (1995) 4.517106
Schindler's List (1993) 4.510417
Raiders of the Lost Ark (1981) 4.477725
Rear Window (1954) 4.476190
Star Wars: Episode IV - A New Hope (1977) 4.453694
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) 4.449890
Casablanca (1942) 4.412822
Sixth Sense, The (1999) 4.406263
Name: rating, dtype: float64
pandas核心数据结构
import pandas as pd
import numpy as np
Series
Series 是一维带标签的数组,数组里可以放任意的数据(整数,浮点数,字符串,Python Object)。其基本的创建函数是:
s = pd.Series(data, index=index)
其中 index 是一个列表,用来作为数据的标签。data 可以是不同的数据类型:
- Python 字典
- ndarray 对象
- 一个标量值,如 5
从 ndaray 创建
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
a 0.747292
b -1.120276
c -0.132692
d -0.267813
e -0.590904
dtype: float64
s.index
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
s = pd.Series(np.random.randn(5))
s
0 0.324214
1 -0.183776
2 -0.518808
3 0.866421
4 -0.601668
dtype: float64
s.index
Int64Index([0, 1, 2, 3, 4], dtype='int64')
从字典创建
# 空值的默认处理
d = {'a' : 0., 'b' : 1., 'd' : 3}
s = pd.Series(d, index=list('abcd'))
s
a 0
b 1
c NaN
d 3
dtype: float64
从标量创建
pd.Series(3, index=list('abcde'))
a 3
b 3
c 3
d 3
e 3
dtype: int64
Series特性
Series 是类 ndarray 对象
熟悉 numpy 的同学对下面的操作应该不会陌生。我们在 numpy 简介里也介绍过下面的索引方式。
s = pd.Series(np.random.randn(5))
s
0 0.882069
1 -0.134360
2 -0.925088
3 0.191072
4 2.546704
dtype: float64
s[0]
0.88206876023157332
s[:3]
0 0.882069
1 -0.134360
2 -0.925088
dtype: float64
s[[1, 3, 4]]
1 -0.134360
3 0.191072
4 2.546704
dtype: float64
np.exp(s)
0 2.415892
1 0.874275
2 0.396497
3 1.210546
4 12.764963
dtype: float64
np.sin(s)
0 0.772055
1 -0.133957
2 -0.798673
3 0.189911
4 0.560416
dtype: float64
Series 是类字典对象
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
a -2.149840
b -0.924115
c 0.481231
d 1.033813
e -0.462794
dtype: float64
s['a']
-2.1498403551053218
s['e'] = 5
s
a -2.149840
b -0.924115
c 0.481231
d 1.033813
e 5.000000
dtype: float64
s['g'] = 100
s
a -2.149840
b -0.924115
c 0.481231
d 1.033813
e 5.000000
g 100.000000
dtype: float64
'e' in s
True
'f' in s
False
# s['f']
print s.get('f')
None
print s.get('f', np.nan)
nan
标签对齐操作
s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'e'])
s2 = pd.Series(np.random.randn(3), index=['a', 'd', 'e'])
print '{0}\n\n{1}'.format(s1, s2)
a -0.917905
c -0.744616
e 0.114522
dtype: float64a 0.721087
d -0.471575
e 0.796093
dtype: float64
s1 + s2
name 属性
s = pd.Series(np.random.randn(5), name='Some Thing')
s
0 0.623787
1 0.517239
2 1.551314
3 1.414463
4 -1.224611
Name: Some Thing, dtype: float64
s.name
'Some Thing'a -0.196818
c NaN
d NaN
e 0.910615
dtype: float64
DataFrame
DataFrame 是二维带行标签和列标签的数组。可以把 DataFrame 想你成一个 Excel 表格或一个 SQL 数据库的表格,还可以相像成是一个 Series 对象字典。它是 Pandas 里最常用的数据结构。
创建 DataFrame 的基本格式是:
df = pd.DataFrame(data, index=index, columns=columns)
其中 index 是行标签,columns 是列标签,data 可以是下面的数据:
- 由一维 numpy 数组,list,Series 构成的字典
- 二维 numpy 数组
- 一个 Series
- 另外的 DataFrame 对象
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)
|
one
|
two
|
a
|
1
|
1
|
b
|
2
|
2
|
c
|
3
|
3
|
d
|
NaN
|
4
|
pd.DataFrame(d, index=['d', 'b', 'a'])
|
one
|
two
|
d
|
NaN
|
4
|
b
|
2
|
2
|
a
|
1
|
1
|
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
|
two
|
three
|
d
|
4
|
NaN
|
b
|
2
|
NaN
|
a
|
1
|
NaN
|
如果是列表,必须元素个数一样,否则会报错
d = {'one' : [1, 2, 3, 4],'two' : [21, 22, 23, 24]}
pd.DataFrame(d)
|
one
|
two
|
0
|
1
|
21
|
1
|
2
|
22
|
2
|
3
|
23
|
3
|
4
|
24
|
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
|
one
|
two
|
a
|
1
|
21
|
b
|
2
|
22
|
c
|
3
|
23
|
d
|
4
|
24
|
从结构化数据中创建
data = [(1, 2.2, 'Hello'), (2, 3., "World")]
pd.DataFrame(data)
|
0
|
1
|
2
|
0
|
1
|
2.2
|
Hello
|
1
|
2
|
3.0
|
World
|
pd.DataFrame(data, index=['first', 'second'], columns=['A', 'B', 'C'])
|
A
|
B
|
C
|
first
|
1
|
2.2
|
Hello
|
second
|
2
|
3.0
|
World
|
从字典列表创建
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data)
|
a
|
b
|
c
|
0
|
1
|
2
|
NaN
|
1
|
5
|
10
|
20
|
```python pd.DataFrame(data, index=['first', 'second']) ```
|
a
|
b
|
c
|
first
|
1
|
2
|
NaN
|
second
|
5
|
10
|
20
|
pd.DataFrame(data, columns=['a', 'b'])
从元组字典创建
了解其创建的原理,实际应用中,会通过数据清洗的方式,把数据整理成方便 Pandas 导入且可读性好的格式。最后再通过 reindex/groupby 等方式转换成复杂数据结构。
d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}
# 多级标签
pd.DataFrame(d)
|
|
a
|
b
|
|
|
a
|
b
|
c
|
a
|
b
|
A
|
B
|
4
|
1
|
5
|
8
|
10
|
C
|
3
|
2
|
6
|
7
|
NaN
|
D
|
NaN
|
NaN
|
NaN
|
NaN
|
9
|
从 Series 创建
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
pd.DataFrame(s)
|
0
|
a
|
-0.789343
|
b
|
0.127384
|
c
|
1.084005
|
d
|
-0.755011
|
e
|
-0.963299
|
pd.DataFrame(s, index=['a', 'c', 'd'])
|
0
|
a
|
-0.789343
|
c
|
1.084005
|
d
|
-0.755011
|
如果这里指定多列,就会报错,不同于字典作为参数进行创建
pd.DataFrame(s, index=['a', 'c', 'd'], columns=['A'])
|
A
|
a
|
-0.789343
|
c
|
1.084005
|
d
|
-0.755011
|
列选择/增加/删除
df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'])
df
|
one
|
two
|
three
|
four
|
0
|
2.045300
|
-0.981722
|
-0.656081
|
-0.639517
|
1
|
-0.550780
|
0.248781
|
-0.146424
|
0.217392
|
2
|
1.702775
|
0.103998
|
-0.662138
|
-0.534071
|
3
|
-2.035681
|
0.015025
|
1.368209
|
0.178378
|
4
|
-1.092208
|
0.091108
|
-0.892496
|
-0.611198
|
5
|
0.093502
|
0.267428
|
1.189654
|
-0.258723
|
df['one']
0 2.045300
1 -0.550780
2 1.702775
3 -2.035681
4 -1.092208
5 0.093502
Name: one, dtype: float64
df['three'] = df['one'] + df['two']
df
|
one
|
two
|
three
|
four
|
0
|
2.045300
|
-0.981722
|
1.063578
|
-0.639517
|
1
|
-0.550780
|
0.248781
|
-0.301999
|
0.217392
|
2
|
1.702775
|
0.103998
|
1.806773
|
-0.534071
|
3
|
-2.035681
|
0.015025
|
-2.020656
|
0.178378
|
4
|
-1.092208
|
0.091108
|
-1.001100
|
-0.611198
|
5
|
0.093502
|
0.267428
|
0.360931
|
-0.258723
|
df['flag'] = df['one'] > 0
df
|
one
|
two
|
three
|
four
|
flag
|
0
|
2.045300
|
-0.981722
|
1.063578
|
-0.639517
|
True
|
1
|
-0.550780
|
0.248781
|
-0.301999
|
0.217392
|
False
|
2
|
1.702775
|
0.103998
|
1.806773
|
-0.534071
|
True
|
3
|
-2.035681
|
0.015025
|
-2.020656
|
0.178378
|
False
|
4
|
-1.092208
|
0.091108
|
-1.001100
|
-0.611198
|
False
|
5
|
0.093502
|
0.267428
|
0.360931
|
-0.258723
|
True
|
del df['three']
df
|
one
|
two
|
four
|
flag
|
0
|
2.045300
|
-0.981722
|
-0.639517
|
True
|
1
|
-0.550780
|
0.248781
|
0.217392
|
False
|
2
|
1.702775
|
0.103998
|
-0.534071
|
True
|
3
|
-2.035681
|
0.015025
|
0.178378
|
False
|
4
|
-1.092208
|
0.091108
|
-0.611198
|
False
|
5
|
0.093502
|
0.267428
|
-0.258723
|
True
|
four = df.pop('four')
four
0 -0.639517
1 0.217392
2 -0.534071
3 0.178378
4 -0.611198
5 -0.258723
Name: four, dtype: float64
df
|
one
|
two
|
flag
|
0
|
2.045300
|
-0.981722
|
True
|
1
|
-0.550780
|
0.248781
|
False
|
2
|
1.702775
|
0.103998
|
True
|
3
|
-2.035681
|
0.015025
|
False
|
4
|
-1.092208
|
0.091108
|
False
|
5
|
0.093502
|
0.267428
|
True
|
df['five'] = 5
df
|
one
|
two
|
flag
|
five
|
0
|
2.045300
|
-0.981722
|
True
|
5
|
1
|
-0.550780
|
0.248781
|
False
|
5
|
2
|
1.702775
|
0.103998
|
True
|
5
|
3
|
-2.035681
|
0.015025
|
False
|
5
|
4
|
-1.092208
|
0.091108
|
False
|
5
|
5
|
0.093502
|
0.267428
|
True
|
5
|
df['one_trunc'] = df['one'][:2]
df
|
one
|
two
|
flag
|
five
|
one_trunc
|
0
|
2.045300
|
-0.981722
|
True
|
5
|
2.04530
|
1
|
-0.550780
|
0.248781
|
False
|
5
|
-0.55078
|
2
|
1.702775
|
0.103998
|
True
|
5
|
NaN
|
3
|
-2.035681
|
0.015025
|
False
|
5
|
NaN
|
4
|
-1.092208
|
0.091108
|
False
|
5
|
NaN
|
5
|
0.093502
|
0.267428
|
True
|
5
|
NaN
|
# 指定插入位置
df.insert(1, 'bar', df['one'])
df
|
one
|
bar
|
two
|
flag
|
five
|
one_trunc
|
0
|
2.045300
|
2.045300
|
-0.981722
|
True
|
5
|
2.04530
|
1
|
-0.550780
|
-0.550780
|
0.248781
|
False
|
5
|
-0.55078
|
2
|
1.702775
|
1.702775
|
0.103998
|
True
|
5
|
NaN
|
3
|
-2.035681
|
-2.035681
|
0.015025
|
False
|
5
|
NaN
|
4
|
-1.092208
|
-1.092208
|
0.091108
|
False
|
5
|
NaN
|
5
|
0.093502
|
0.093502
|
0.267428
|
True
|
5
|
NaN
|
使用 assign() 方法来插入新列
更方便地使用 methd chains 的方法来实现
df = pd.DataFrame(np.random.randint(1, 5, (6, 4)), columns=list('ABCD'))
df
|
A
|
B
|
C
|
D
|
0
|
4
|
3
|
3
|
4
|
1
|
1
|
4
|
4
|
2
|
2
|
1
|
4
|
4
|
3
|
3
|
2
|
4
|
4
|
3
|
4
|
1
|
2
|
4
|
2
|
5
|
3
|
4
|
1
|
4
|
df.assign(Ratio = df['A'] / df['B'])
|
A
|
B
|
C
|
D
|
Ratio
|
0
|
4
|
3
|
3
|
4
|
1.333333
|
1
|
1
|
4
|
4
|
2
|
0.250000
|
2
|
1
|
4
|
4
|
3
|
0.250000
|
3
|
2
|
4
|
4
|
3
|
0.500000
|
4
|
1
|
2
|
4
|
2
|
0.500000
|
5
|
3
|
4
|
1
|
4
|
0.750000
|
insert是直接改变原DataFrame,assign是生成一个新的,assign的另一个特性是可以传入函数进行计算。
df.assign(AB_Ratio = lambda x: x.A / x.B, CD_Ratio = lambda x: x.C - x.D)
|
A
|
B
|
C
|
D
|
AB_Ratio
|
CD_Ratio
|
0
|
4
|
3
|
3
|
4
|
1.333333
|
-1
|
1
|
1
|
4
|
4
|
2
|
0.250000
|
2
|
2
|
1
|
4
|
4
|
3
|
0.250000
|
1
|
3
|
2
|
4
|
4
|
3
|
0.500000
|
1
|
4
|
1
|
2
|
4
|
2
|
0.500000
|
2
|
5
|
3
|
4
|
1
|
4
|
0.750000
|
-3
|
df.assign(AB_Ratio = lambda x: x.A / x.B).assign(ABD_Ratio = lambda x: x.AB_Ratio * x.D)
|
A
|
B
|
C
|
D
|
AB_Ratio
|
ABD_Ratio
|
0
|
4
|
3
|
3
|
4
|
1.333333
|
5.333333
|
1
|
1
|
4
|
4
|
2
|
0.250000
|
0.500000
|
2
|
1
|
4
|
4
|
3
|
0.250000
|
0.750000
|
3
|
2
|
4
|
4
|
3
|
0.500000
|
1.500000
|
4
|
1
|
2
|
4
|
2
|
0.500000
|
1.000000
|
5
|
3
|
4
|
1
|
4
|
0.750000
|
3.000000
|
索引和选择
对应的操作,语法和返回结果
- 选择一列 -> df[col] -> Series
- 根据行标签选择一行 -> df.loc[label] -> Series
- 根据行位置选择一行 -> df.iloc[label] -> Series
- 选择多行 -> df[5:10] -> DataFrame
- 根据布尔向量选择多行 -> df[bool_vector] -> DataFrame
df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index=list('abcdef'), columns=list('ABCD'))
df
|
A
|
B
|
C
|
D
|
a
|
2
|
2
|
6
|
6
|
b
|
8
|
3
|
5
|
7
|
c
|
4
|
6
|
8
|
3
|
d
|
7
|
8
|
3
|
9
|
e
|
8
|
4
|
4
|
2
|
f
|
4
|
2
|
4
|
3
|
df['A']
a 2
b 8
c 4
d 7
e 8
f 4
Name: A, dtype: int32
df.loc['a']
A 2
B 2
C 6
D 6
Name: a, dtype: int32
df.iloc[0]
A 2
B 2
C 6
D 6
Name: a, dtype: int32
df[1:4]
|
A
|
B
|
C
|
D
|
b
|
8
|
3
|
5
|
7
|
c
|
4
|
6
|
8
|
3
|
d
|
7
|
8
|
3
|
9
|
使用iloc比这样直接访问效率要高df.iloc[1:4]
df[[False, True, True, False, True, False]]
|
A
|
B
|
C
|
D
|
b
|
8
|
3
|
5
|
7
|
c
|
4
|
6
|
8
|
3
|
e
|
8
|
4
|
4
|
2
|
数据对齐
DataFrame 在进行数据计算时,会自动按行和列进行数据对齐。最终的计算结果会合并两个 DataFrame。
df1 = pd.DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), columns=['A', 'B', 'C', 'D'])
df1
|
A
|
B
|
C
|
D
|
a
|
0.576428
|
-0.037913
|
-0.329787
|
-1.752916
|
b
|
0.406743
|
-1.044561
|
-0.724447
|
0.374599
|
c
|
0.073578
|
0.423914
|
-1.499770
|
-0.488374
|
d
|
-0.377609
|
1.137422
|
-1.951169
|
-0.814306
|
e
|
-2.171648
|
-2.364502
|
-0.833594
|
0.168636
|
f
|
-1.134800
|
-0.927469
|
0.886889
|
0.542603
|
g
|
0.625104
|
0.115953
|
-1.282609
|
1.031292
|
h
|
0.403509
|
0.263207
|
0.403614
|
-0.177888
|
i
|
0.148494
|
-2.034253
|
0.134859
|
-0.960650
|
j
|
0.094200
|
-1.803288
|
0.057472
|
-0.338958
|
df2 = pd.DataFrame(np.random.randn(7, 3), index=list('cdefghi'), columns=['A', 'B', 'C'])
df2
|
A
|
B
|
C
|
c
|
0.884518
|
0.337344
|
-1.072027
|
d
|
0.264036
|
-0.152542
|
-0.225544
|
e
|
1.048813
|
-1.496442
|
1.022348
|
f
|
0.895314
|
-0.890236
|
1.230465
|
g
|
-0.588162
|
-0.492354
|
-0.739563
|
h
|
-2.580322
|
1.104810
|
-0.167137
|
i
|
-0.842738
|
0.171735
|
0.847714
|
df1 + df2
|
A
|
B
|
C
|
D
|
a
|
NaN
|
NaN
|
NaN
|
NaN
|
b
|
NaN
|
NaN
|
NaN
|
NaN
|
c
|
0.958096
|
0.761259
|
-2.571797
|
NaN
|
d
|
-0.113573
|
0.984880
|
-2.176713
|
NaN
|
e
|
-1.122834
|
-3.860944
|
0.188754
|
NaN
|
f
|
-0.239486
|
-1.817705
|
2.117354
|
NaN
|
g
|
0.036942
|
-0.376401
|
-2.022171
|
NaN
|
h
|
-2.176813
|
1.368016
|
0.236476
|
NaN
|
i
|
-0.694245
|
-1.862517
|
0.982573
|
NaN
|
j
|
NaN
|
NaN
|
NaN
|
NaN
|
df1 - df1.iloc[0]
|
A
|
B
|
C
|
D
|
a
|
0.000000
|
0.000000
|
0.000000
|
0.000000
|
b
|
-0.169685
|
-1.006648
|
-0.394660
|
2.127515
|
c
|
-0.502850
|
0.461827
|
-1.169983
|
1.264541
|
d
|
-0.954037
|
1.175335
|
-1.621382
|
0.938610
|
e
|
-2.748076
|
-2.326589
|
-0.503807
|
1.921551
|
f
|
-1.711228
|
-0.889556
|
1.216676
|
2.295518
|
g
|
0.048676
|
0.153866
|
-0.952822
|
2.784208
|
h
|
-0.172919
|
0.301119
|
0.733400
|
1.575028
|
i
|
-0.427934
|
-1.996340
|
0.464646
|
0.792265
|
j
|
-0.482228
|
-1.765375
|
0.387259
|
1.413957
|
这里两个运算对象不是同类型,用到广播原理,生成新的DataFrame,是原来DataFrame的每行减去第0行。
使用 numpy 函数
Pandas 与 numpy 在核心数据结构上是完全兼容的
df = pd.DataFrame(np.random.randn(10, 4), columns=['one', 'two', 'three', 'four'])
df
|
one
|
two
|
three
|
four
|
0
|
-1.121818
|
1.233686
|
0.681618
|
-0.502204
|
1
|
1.469664
|
-0.060555
|
-0.044857
|
0.725021
|
2
|
1.219670
|
0.108709
|
1.806063
|
0.332685
|
3
|
-0.190615
|
1.244102
|
-0.863850
|
1.795335
|
4
|
-0.133109
|
-0.101591
|
0.818724
|
1.246230
|
5
|
0.729804
|
0.716593
|
2.472841
|
-0.078224
|
6
|
0.010136
|
1.725441
|
-1.071194
|
1.602945
|
7
|
1.002507
|
-1.122593
|
-0.147411
|
-1.678843
|
8
|
-0.550077
|
0.230777
|
-0.658470
|
-1.680395
|
9
|
1.006271
|
0.455683
|
-2.279833
|
-0.823792
|
np.exp(df)
|
one
|
two
|
three
|
four
|
0
|
0.325687
|
3.433864
|
1.977073
|
0.605196
|
1
|
4.347774
|
0.941242
|
0.956134
|
2.064774
|
2
|
3.386069
|
1.114838
|
6.086440
|
1.394708
|
3
|
0.826450
|
3.469817
|
0.421536
|
6.021490
|
4
|
0.875369
|
0.903399
|
2.267604
|
3.477210
|
5
|
2.074675
|
2.047446
|
11.856082
|
0.924757
|
6
|
1.010187
|
5.614995
|
0.342599
|
4.967641
|
7
|
2.725105
|
0.325435
|
0.862939
|
0.186590
|
8
|
0.576905
|
1.259578
|
0.517643
|
0.186300
|
9
|
2.735382
|
1.577250
|
0.102301
|
0.438765
|
np.asarray(df) == df.values
array([[ True, True, True, True],[ True, True, True, True],[ True, True, True, True],[ True, True, True, True],[ True, True, True, True],[ True, True, True, True],[ True, True, True, True],[ True, True, True, True],[ True, True, True, True],[ True, True, True, True]], dtype=bool)
type(np.asarray(df))
numpy.ndarray
np.asarray(df) == df
|
one
|
two
|
three
|
four
|
0
|
True
|
True
|
True
|
True
|
1
|
True
|
True
|
True
|
True
|
2
|
True
|
True
|
True
|
True
|
3
|
True
|
True
|
True
|
True
|
4
|
True
|
True
|
True
|
True
|
5
|
True
|
True
|
True
|
True
|
6
|
True
|
True
|
True
|
True
|
7
|
True
|
True
|
True
|
True
|
8
|
True
|
True
|
True
|
True
|
9
|
True
|
True
|
True
|
True
|
Panel
Panel 是三维带标签的数组。实际上,Pandas 的名称由来就是由 Panel 演进的,即 pan(el)-da(ta)-s。Panel 比较少用,但依然是最重要的基础数据结构之一。
- items: 坐标轴 0,索引对应的元素是一个 DataFrame
- major_axis: 坐标轴 1, DataFrame 里的行标签
- minor_axis: 坐标轴 2, DataFrame 里的列标签
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),'Item2' : pd.DataFrame(np.random.randn(4, 2))}
pn = pd.Panel(data)
pn
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
pn['Item1']
|
0
|
1
|
2
|
0
|
0.638298
|
-1.600822
|
3.112210
|
1
|
0.394099
|
0.184129
|
0.438450
|
2
|
0.427692
|
-0.294556
|
0.039430
|
3
|
1.555046
|
0.933749
|
0.218616
|
pn.items
Index([u'Item1', u'Item2'], dtype='object')
pn.major_axis
Int64Index([0, 1, 2, 3], dtype='int64')
pn.minor_axis
Int64Index([0, 1, 2], dtype='int64')
# 函数调用
pn.major_xs(pn.major_axis[0])
|
Item1
|
Item2
|
0
|
0.638298
|
-1.427579
|
1
|
-1.600822
|
-0.778090
|
2
|
3.112210
|
NaN
|
# 函数调用
pn.minor_xs(pn.major_axis[1])
|
Item1
|
Item2
|
0
|
-1.600822
|
-0.778090
|
1
|
0.184129
|
0.698347
|
2
|
-0.294556
|
-0.167423
|
3
|
0.933749
|
0.205092
|
pn.to_frame()
|
|
Item1
|
Item2
|
major
|
minor
|
|
|
0
|
0
|
0.638298
|
-1.427579
|
1
|
-1.600822
|
-0.778090
|
1
|
0
|
0.394099
|
-0.999929
|
1
|
0.184129
|
0.698347
|
2
|
0
|
0.427692
|
0.559905
|
1
|
-0.294556
|
-0.167423
|
3
|
0
|
1.555046
|
-1.992102
|
1
|
0.933749
|
0.205092
|
自动去除了NaN一列
基础运算
import pandas as pd
import numpy as np
重新索引
Series
s = pd.Series([1, 3, 5, 6, 8], index=list('acefh'))
s
a 1
c 3
e 5
f 6
h 8
dtype: int64
s.reindex(list('abcdefgh'))
a 1
b NaN
c 3
d NaN
e 5
f 6
g NaN
h 8
dtype: float64
s.reindex(list('abcdefgh'), fill_value=0)
a 1
b 0
c 3
d 0
e 5
f 6
g 0
h 8
dtype: int64
# method='bfill'
s.reindex(list('abcdefgh'), method='ffill')
a 1
b 1
c 3
d 3
e 5
f 6
g 6
h 8
dtype: int64
DataFrame
df = pd.DataFrame(np.random.randn(4, 6), index=list('ADFH'), columns=['one', 'two', 'three', 'four', 'five', 'six'])
df
|
one
|
two
|
three
|
four
|
five
|
six
|
A
|
-0.049437
|
-0.526499
|
1.780662
|
1.154747
|
2.434957
|
-1.579278
|
D
|
-0.075226
|
0.552163
|
-0.462732
|
-0.936051
|
-0.590041
|
0.484505
|
F
|
1.486168
|
0.725907
|
0.598127
|
-0.704809
|
-2.815687
|
-0.062462
|
H
|
-0.900819
|
-0.177751
|
-0.232796
|
0.234088
|
-1.758574
|
1.255955
|
df2 = df.reindex(index=list('ABCDEFGH'))
df2
默认也可以不写index,但columns得写
|
one
|
two
|
three
|
four
|
five
|
six
|
A
|
-0.049437
|
-0.526499
|
1.780662
|
1.154747
|
2.434957
|
-1.579278
|
B
|
NaN
|
NaN
|
NaN
|
NaN
|
NaN
|
NaN
|
C
|
NaN
|
NaN
|
NaN
|
NaN
|
NaN
|
NaN
|
D
|
-0.075226
|
0.552163
|
-0.462732
|
-0.936051
|
-0.590041
|
0.484505
|
E
|
NaN
|
NaN
|
NaN
|
NaN
|
NaN
|
NaN
|
F
|
1.486168
|
0.725907
|
0.598127
|
-0.704809
|
-2.815687
|
-0.062462
|
G
|
NaN
|
NaN
|
NaN
|
NaN
|
NaN
|
NaN
|
H
|
-0.900819
|
-0.177751
|
-0.232796
|
0.234088
|
-1.758574
|
1.255955
|
reindex是拷贝一份数据出来
df.reindex(columns=['one', 'three', 'five', 'seven'])
|
one
|
three
|
five
|
seven
|
A
|
100.000000
|
1.780662
|
2.434957
|
NaN
|
D
|
-0.075226
|
-0.462732
|
-0.590041
|
NaN
|
F
|
1.486168
|
0.598127
|
-2.815687
|
NaN
|
H
|
-0.900819
|
-0.232796
|
-1.758574
|
NaN
|
df.reindex(columns=['one', 'three', 'five', 'seven'], fill_value=0)
|
one
|
three
|
five
|
seven
|
A
|
100.000000
|
1.780662
|
2.434957
|
0
|
D
|
-0.075226
|
-0.462732
|
-0.590041
|
0
|
F
|
1.486168
|
0.598127
|
-2.815687
|
0
|
H
|
-0.900819
|
-0.232796
|
-1.758574
|
0
|
# fill method 只对行有效
df.reindex(columns=['one', 'three', 'five', 'seven'], method='ffill')
|
one
|
three
|
five
|
seven
|
A
|
100.000000
|
1.780662
|
2.434957
|
NaN
|
D
|
-0.075226
|
-0.462732
|
-0.590041
|
NaN
|
F
|
1.486168
|
0.598127
|
-2.815687
|
NaN
|
H
|
-0.900819
|
-0.232796
|
-1.758574
|
NaN
|
df.reindex(index=list('ABCDEFGH'), method='ffill')
|
one
|
two
|
three
|
four
|
five
|
six
|
A
|
100.000000
|
-0.526499
|
1.780662
|
1.154747
|
2.434957
|
-1.579278
|
B
|
100.000000
|
-0.526499
|
1.780662
|
1.154747
|
2.434957
|
-1.579278
|
C
|
100.000000
|
-0.526499
|
1.780662
|
1.154747
|
2.434957
|
-1.579278
|
D
|
-0.075226
|
0.552163
|
-0.462732
|
-0.936051
|
-0.590041
|
0.484505
|
E
|
-0.075226
|
0.552163
|
-0.462732
|
-0.936051
|
-0.590041
|
0.484505
|
F
|
1.486168
|
0.725907
|
0.598127
|
-0.704809
|
-2.815687
|
-0.062462
|
G
|
1.486168
|
0.725907
|
0.598127
|
-0.704809
|
-2.815687
|
-0.062462
|
H
|
-0.900819
|
-0.177751
|
-0.232796
|
0.234088
|
-1.758574
|
1.255955
|
丢弃部分数据
df = pd.DataFrame(np.random.randn(4, 6), index=list('ABCD'), columns=['one', 'two', 'three', 'four', 'five', 'six'])
df
|
one
|
two
|
three
|
four
|
five
|
six
|
A
|
-0.665415
|
-0.061367
|
0.075058
|
0.626415
|
-1.748458
|
-0.608540
|
B
|
-1.455186
|
1.846691
|
0.234276
|
0.660298
|
-2.169835
|
-1.476485
|
C
|
0.322281
|
0.505378
|
0.198458
|
-0.831919
|
-0.630789
|
0.762524
|
D
|
0.703684
|
-0.827597
|
0.178063
|
0.108453
|
-0.418992
|
0.242912
|
df.drop('A')
|
one
|
two
|
three
|
four
|
five
|
six
|
B
|
-1.455186
|
1.846691
|
0.234276
|
0.660298
|
-2.169835
|
-1.476485
|
C
|
0.322281
|
0.505378
|
0.198458
|
-0.831919
|
-0.630789
|
0.762524
|
D
|
0.703684
|
-0.827597
|
0.178063
|
0.108453
|
-0.418992
|
0.242912
|
df2 = df.drop(['two', 'four'], axis=1)
df2
|
one
|
three
|
five
|
six
|
A
|
-0.665415
|
0.075058
|
-1.748458
|
-0.608540
|
B
|
-1.455186
|
0.234276
|
-2.169835
|
-1.476485
|
C
|
0.322281
|
0.198458
|
-0.630789
|
0.762524
|
D
|
0.703684
|
0.178063
|
-0.418992
|
0.242912
|
drop是是拷贝一份数据出来的
广播运算
df = pd.DataFrame(np.arange(12).reshape(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df
|
A
|
B
|
C
|
one
|
0
|
1
|
2
|
two
|
3
|
4
|
5
|
three
|
6
|
7
|
8
|
four
|
9
|
10
|
11
|
df.loc['one']
A 0
B 1
C 2
Name: one, dtype: int64
df - df.loc['one']
|
A
|
B
|
C
|
one
|
0
|
0
|
0
|
two
|
3
|
3
|
3
|
three
|
6
|
6
|
6
|
four
|
9
|
9
|
9
|
函数应用
- apply: 将数据按行或列进行计算
- applymap: 将数据按元素为进行计算
df = pd.DataFrame(np.arange(12).reshape(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df
|
A
|
B
|
C
|
one
|
0
|
1
|
2
|
two
|
3
|
4
|
5
|
three
|
6
|
7
|
8
|
four
|
9
|
10
|
11
|
# 每一列作为一个 Series 作为参数传递给 lambda 函数
df.apply(lambda x: x.max() - x.min())
A 9
B 9
C 9
dtype: int64
# 每一行作为一个 Series 作为参数传递给 lambda 函数
df.apply(lambda x: x.max() - x.min(), axis=1)
one 2
two 2
three 2
four 2
dtype: int64
# 返回多个值组成的 Series
def min_max(x):return pd.Series([x.min(), x.max()], index=['min', 'max'])
df.apply(min_max, axis=1)
|
min
|
max
|
one
|
0
|
2
|
two
|
3
|
5
|
three
|
6
|
8
|
four
|
9
|
11
|
# applymap: 逐元素运算
df = pd.DataFrame(np.random.randn(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df
|
A
|
B
|
C
|
one
|
-1.126089
|
-0.286584
|
1.538841
|
two
|
1.804348
|
-0.709293
|
-0.400643
|
three
|
-1.008037
|
-0.791648
|
0.388505
|
four
|
-0.071827
|
0.659098
|
-0.505030
|
formater = '{0:.02f}'.format
# formater = lambda x: '%.02f' % x
df.applymap(formater)
0表示第0个参数,formater的写法也是一种函数
|
A
|
B
|
C
|
one
|
-1.13
|
-0.29
|
1.54
|
two
|
1.80
|
-0.71
|
-0.40
|
three
|
-1.01
|
-0.79
|
0.39
|
four
|
-0.07
|
0.66
|
-0.51
|
排序和排名
df = pd.DataFrame(np.random.randint(1, 10, (4, 3)), index=list('ABCD'), columns=['one', 'two', 'three'])
df
|
one
|
two
|
three
|
A
|
7
|
8
|
5
|
B
|
2
|
6
|
4
|
C
|
4
|
5
|
4
|
D
|
6
|
2
|
1
|
df.sort_values(by='one')
|
one
|
two
|
three
|
B
|
2
|
6
|
4
|
C
|
4
|
5
|
4
|
D
|
6
|
2
|
1
|
A
|
7
|
8
|
5
|
s = pd.Series([3, 6, 2, 6, 4])
s.rank()
0 2.0
1 4.5
2 1.0
3 4.5
4 3.0
dtype: float64
s.rank(method='first', ascending=False)
method='first’表示并列时,先出现排前面
默认是average,还有一种是last
0 4
1 1
2 5
3 2
4 3
dtype: float64
DataFrame默认是按照每列进行排名,与sort_values不同
数据唯一性及成员资格
适用于 Series
s = pd.Series(list('abbcdabacad'))
s
0 a
1 b
2 b
3 c
4 d
5 a
6 b
7 a
8 c
9 a
10 d
dtype: object
s.unique()
array(['a', 'b', 'c', 'd'], dtype=object)
s.value_counts()
a 4
b 3
d 2
c 2
dtype: int64
s.isin(['a', 'b', 'c'])
0 True
1 True
2 True
3 True
4 False
5 True
6 True
7 True
8 True
9 True
10 False
dtype: bool
python数据科学包第二天相关推荐
- python 数据科学 包_什么时候应该使用哪个Python数据科学软件包?
python 数据科学 包 Python is the most popular language for data science. Unfortunately, it can be tricky ...
- 看看这些鲜为人知的宝藏Python数据科学包吧!
动态数据科学的这三剑客几乎无人不知无人不晓:Numpy,Pandas和Matplotlib.你可能已经熟悉这些包以及它们的运作方式. 还有其他很炫酷的包,你肯定也想试一试,例如Plotly,Seabo ...
- python数据科学包第三天(索引、分组计算、数据聚合、分组运算和转换、载入数据、日期范围、数据可视化)
索引 行索引 列索引 索引的分类 重复索引的处理 s = pd.Series(np.random.rand(5), index=list('abcde')) s a 0.566924 b 0.6034 ...
- Python数据科学包(六)-----数据可视化和例子
文章目录 一. 数据可视化 1. 线型图 2. 柱状图 3. 直方图 4. 密度图 5. 散布图 6. 饼图 7. 高级绘图 二. 股票数据分析 1. 分析波动幅度 2. 增长曲线 3. 增长倍数 4 ...
- python数据科学包(七)—— matplotlib实战之绘制球员能力图和股票K线图
1.球员能力图 # -*- coding: utf-8 -*- import numpy as np import pandas as pd import matplotlib.pyplot as p ...
- 使用python构建向量空间_使用Docker构建Python数据科学容器
人工智能(AI)和机器学习(ML)最近真的火了,并驱动了从自动驾驶汽车到药物发现等等应用领域的快速发展.AI和ML的前途一片光明. 另一方面,Docker通过引入临时轻量级容器彻底改变了计算世界.通过 ...
- python第二阶段(2)入门-数据科学包 pandas
数据科学包 pandas 导入pandas 创建对象 1 系列 2 日期序列(1) 3 日期序列(2) 4 Series的操作(1) 5 Series的操作(2) 合并,新增,连接和比较 1 连接 2 ...
- python数据科学手册_小白入门Python数据科学
前言 本文讲解了从零开始学习Python数据科学的全过程,涵盖各种工具和方法 你将会学习到如何使用python做基本的数据分析 你还可以了解机器学习算法的原理和使用 说明 先说一段题外话.我是一名数据 ...
- python 数据科学书籍_您必须在2020年阅读的数据科学书籍
python 数据科学书籍 "We're entering a new world in which data may be more important than software.&qu ...
最新文章
- python移除链表元素
- ABAP-AVL-OO方法中的ALV的如何自己添加按钮及其响应
- Thinkpad在Windows8上热键的解决方案
- 【转】libpcap实现机制及接口函数
- Spring Boot笔记-解决前后端分离在开发时的跨域问题
- 设计模式(八)组合模式 Composite
- Python的实例方法,类方法,静态方法之间的区别及调用关系
- 阿里开源 iOS 协程开发框架 coobjc源码分析
- 13-微信小程序商城 产品简介布局(微信小程序商城开发、小程序毕业设计、小程序源代码)(黄菊华-微信小程序开发教程)
- 详述RFID服装智能管理方案
- 在上海、苏州、深圳、长沙从“蜗居”到“安家”,8090后要付出多少?
- 普度大学计算机科学博士,普渡大学计算机系 Yongle Zhang课题组招收全奖博士生...
- pytorch实现bert_精细调整bert和roberta以在pytorch中实现高精度文本分类
- Word 题注重新编号
- ssh远程登录命令简单实例
- 国一大佬也收藏的16个Python数据可视化案例(附源码)
- 035 导数 微分对应表
- 【C/C++】标准库之 numeric
- 计算机bios设置方法,bios功能怎么设置_bios设置图解教程
- MATLAB提取txt文本文档中特定关键字后的数字信息
热门文章
- 星形和雪花模型_星型模型和雪花型模型比较
- Vue跨域配置proxyTable中pathRewrite用法
- 订单服务-----功能实现逻辑
- DuerOS技能交互
- C/C++语言实现一百个数组之间求任意数组的交集和并集
- OA行业分析:大型企业如何选型OA系统?
- CentOS 关闭virbr0
- Clouder—构建企业级数据分析平台-墨羽@袋鼠云
- iMovie使用技巧
- 01-----Ubuntu16.04安装Gnome桌面环境