python数据科学包第二天

文章目录

pandas入门
- 矩阵运算
- 重建索引
- 统计
- 数据合并
- 分组统计
- 数据整形
- 数据透视表
- 时间序列
- 类别数据
- 画图
- 数据读写
电影数据分析
- 准备工作
- 数据说明
- 利用 Pandas 分析电影评分数据
- 数据合并
pandas核心数据结构
- Series
- - 从 ndaray 创建
  - 从字典创建
  - 从标量创建
- Series特性
- - Series 是类 ndarray 对象
  - Series 是类字典对象
  - 标签对齐操作
  - name 属性
- DataFrame
- - 从结构化数据中创建
  - 从字典列表创建
  - 从元组字典创建
  - 从 Series 创建
  - 列选择/增加/删除
  - 使用 assign() 方法来插入新列
  - 索引和选择
  - 数据对齐
  - 使用 numpy 函数
- Panel
基础运算
重新索引
- Series
- DataFrame
丢弃部分数据
- 广播运算
- 函数应用
- 排序和排名
- 数据唯一性及成员资格

pandas入门

首先，下面所有的例子均需要导入包：

# 设置为 inline 风格
%matplotlib inline
# 包导入
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

一维数组：Series：

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

输出：

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

二维数组创建：

dates = pd.date_range('20160301', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

输出

	A	B	C	D
2016-03-01	-0.859805	-0.069692	-0.905092	-0.553213
2016-03-02	-0.353785	0.031793	-0.785213	-0.212337
2016-03-03	1.719976	0.925145	0.241639	-0.490166
2016-03-04	-1.207854	-0.001647	-0.468976	-0.781144
2016-03-05	0.452034	1.371208	1.152729	1.470498
2016-03-06	1.378227	0.246941	-1.186630	-0.411647

df.valuesarray([[-8.59804740e-01, -6.96922374e-02, -9.05091676e-01,-5.53212518e-01],[-3.53785450e-01,  3.17933613e-02, -7.85212585e-01,-2.12337229e-01],[ 1.71997643e+00,  9.25144720e-01,  2.41639347e-01,-4.90166361e-01],[-1.20785422e+00, -1.64720630e-03, -4.68976209e-01,-7.81144372e-01],[ 4.52033577e-01,  1.37120779e+00,  1.15272905e+00,1.47049771e+00],[ 1.37822701e+00,  2.46941166e-01, -1.18662963e+00,-4.11647030e-01]])

用字典来构建DataFrame，如下：

# 使用字典来创建：key 为 DataFrame 的列；value 为对应列下的值
df1 = pd.DataFrame({'A': 1,'B': pd.Timestamp('20160301'),'C': range(4),'D': np.arange(5, 9),'E': 'text','F': ['AA', 'BB', 'CC', 'DD']})
df1

	A	B	C	D	E	F
0	1	2016-03-01	0	5	text	AA
1	1	2016-03-01	1	6	text	BB
2	1	2016-03-01	2	7	text	CC
3	1	2016-03-01	3	8	text	DD

各个列可以通过DataFrame的属性来访问，如

df.A、df.B

type(df.A)

pandas.core.series.Series

可以看出，每一列或者每一行，都是Series数据

df.types  #查看每一列的数据类型
df.shape  #查看数据区的矩阵行列数
df.head() #查看矩阵的前五行，在函数内指定数字，可以查看指定行数
df.tail() #查看矩阵的后五行
df.index  #矩阵的行索引
df.columns#矩阵的列索引
df.values #查看矩阵的值，返回的是一个ndarray类型矩阵
df.describe()#查看矩阵的统计信息，默认是axis=0，及每一列的各行数据参与运算，生成每一列的统计信息，包括（每一列有多少值，平均值，标准差，最小值，四分位，中位数，四分之三分位，最大值），其中四分位的计算方法是减1计算法，具体可以查看百度百科
df.T      #矩阵的转置，或者使用transpose，类似numpy的方法
df.sort_index()#将行索引排序，参数axis可以指定为1，对列索引排序，ascending表示降序还是升序
df.sort_values(by='A')#依据每一列的值进行排序
df.['A']  #选择某列
df.A
df[2:4]   #选择行，数字也可以换成行索引，布尔值索引，但当行索引也是数字时，默认使用数字，使用索引的切片，是包含最后一个元素所在的行的
df.loc[]  #使用索引来查找数据
df.iloc[] #使用数字来查找数据
#例子
df.loc[:,['B','C']],也可以访问某个特定的值
#
df.at[,]     #根据行列标签（索引）选择数据
df.iat[,]    #根据数字选择数据df[df.A>0]   #根据布尔索引，找出A列大于0的行
df[df>0]     #非零的数据为NaN

矩阵运算

df['tag'] = ['a'] * 2 + ['b'] * 2 + ['c'] * 2

	A	B	C	D	tag
2016-03-01	0.033228	2.307123	-0.585367	-1.671832	a
2016-03-02	-1.967299	0.727670	-0.190863	-0.163514	a
2016-03-03	0.065359	0.696804	-0.550040	0.717347	b
2016-03-04	0.234850	0.289520	-1.087173	1.534277	b
2016-03-05	-1.459620	1.040987	0.220130	-0.068131	c
2016-03-06	0.865402	2.650889	-0.015460	-0.111889	c

df[df.tag.isin(['a', 'c'])]

	A	B	C	D	tag
2016-03-01	0.033228	2.307123	-0.585367	-1.671832	a
2016-03-02	-1.967299	0.727670	-0.190863	-0.163514	a
2016-03-05	-1.459620	1.040987	0.220130	-0.068131	c
2016-03-06	0.865402	2.650889	-0.015460	-0.111889	c

### 修改元素

s = pd.Series(np.arange(6), index=pd.date_range('20160301', periods=6))
df['E'] = s   # 修改一列
df.at[pd.Timestamp('20160301'), 'A'] = 0.4  #  修改某个元素
df.B = 200  #将一列的值都修改成一个值
df.iloc[:,2:5] = 1000 #修改子表

要结束网页的jupyter notebook，在控制台两次ctrl+c

重建索引

dates = pd.date_range('20160301', periods=6)
df = pd.DataFrame(data=np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

	A	B	C	D
2016-03-01	-0.985666	0.240058	0.716721	0.352009
2016-03-02	-1.563644	0.091766	1.081764	0.951541
2016-03-03	0.279760	-0.316136	1.198073	-0.562947
2016-03-04	1.174777	-0.225305	-0.280256	-0.074768
2016-03-05	2.173366	0.907038	-1.104678	-0.921779
2016-03-06	0.200422	0.442619	1.970330	-0.609867

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1

	A	B	C	D	E
2016-03-01	-0.985666	0.240058	0.716721	0.352009	NaN
2016-03-02	-1.563644	0.091766	1.081764	0.951541	NaN
2016-03-03	0.279760	-0.316136	1.198073	-0.562947	NaN
2016-03-04	1.174777	-0.225305	-0.280256	-0.074768	NaN

### 处理丢失数据

df1.loc[dates[1:3], 'E'] = 1
df1

	A	B	C	D	E
2016-03-01	-0.985666	0.240058	0.716721	0.352009	NaN
2016-03-02	-1.563644	0.091766	1.081764	0.951541	1
2016-03-03	0.279760	-0.316136	1.198073	-0.562947	1
2016-03-04	1.174777	-0.225305	-0.280256	-0.074768	NaN

此时的E列有两个空值

df1.dropna()  #去除了空值所在的行
df1.fillna(value=5) #将空值替换成5
pd.isnull(df1) #判断是否有空数据
pd.isnull(df1).any()#列中是否有空数据
pd.isnull(df1).any().any()#表中是否有空数据

统计

空数据是不参与统计计算的

df1.mean()  #各列的平均值
df.sum(axis='columns') #计算加和
df.cumsum() #计算累加值
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)  #shift表示数据段整体往后移动两个位置，前面空出来的位置为NaN
df.sub(s, axis='index')#每个列都减处理，控制行变成空值，一般都是对空值做预处理在减
df.apply(np.cumsum) #效果同df.cumsum()，注意这里的sumsum属于np模块，不是pd模块，apply是吧一个列作为参数传给函数，applymap是把每个元素做处理
df.apply(lambda x: x.max() - x.min()) #得出每一列最大值减最小值

s = pd.Series(np.random.randint(0, 7, size=10)) ￥不包括最后一个数，但python自带函数是包括的
s.value_counts() #统计各种数字的数量
s.mode()         #最多的数是哪些

数据合并

df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))

	A	B	C	D
0	1.098103	-0.843356	-0.379135	0.419353
1	-0.177702	-0.225926	-0.363542	-0.153022
2	1.938231	0.154881	0.291382	0.152774
3	-0.460645	-0.268697	-1.509469	0.698776
4	-0.397048	-0.958223	0.212833	-0.435485
5	0.525406	-0.177595	0.453216	-0.093792
6	0.531912	-0.832667	0.200721	0.943878
7	-0.740845	0.098634	0.274020	1.671997
8	2.182379	1.729010	1.306269	0.580677
9	-0.031538	0.159714	0.736667	-0.122326

df1 = pd.concat([df.iloc[:3], df.iloc[3:7], df.iloc[7:]])
(df == df1).all().all()  #判断两个矩阵是否相等# SQL 样式的联合查询
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
# SELECT * FROM left INNER JOIN right ON left.key = right.key;
pd.merge(left, right, on='key')s = pd.Series(np.random.randint(1, 5, size=4), index=list('ABCD'))
df.append(s, ignore_index=True)

如果s有五列，在插入后，多出来新的一列，除了最后一行以外，其他都是NaN

分组统计

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
df

	A	B	C	D
0	foo	one	-0.580320	-1.460149
1	bar	one	1.471201	-1.079598
2	foo	two	0.094836	1.513204
3	bar	three	-1.498810	0.754968
4	foo	two	0.180709	0.415266
5	bar	two	0.358515	-0.341988
6	foo	one	-0.121082	-0.408148
7	foo	three	0.404648	-0.320882

df.groupby('A').sum()

	C	D
A
bar	0.330906	-0.666618
foo	-0.021208	-0.260709

df.groupby(['A', 'B']).sum()

		C	D
A	B
bar	one	1.471201	-1.079598
	three	-1.498810	0.754968
	two	0.358515	-0.341988
foo	one	-0.701402	-1.868297
	three	0.404648	-0.320882
	two	0.275545	1.928470

形成了双索引的结构

数据整形

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

		A	B
first	second
bar	one	0.072026	0.422077
bar	two	-1.099181	-0.354796
baz	one	1.285500	-1.185525
baz	two	0.645316	-0.660115
foo	one	0.696443	-1.664527
foo	two	0.718399	-0.154125
qux	one	-0.740052	0.713089
qux	two	-0.672748	-1.346843

stacked = df.stack()

first  second
bar    one     A    0.072026B    0.422077two     A   -1.099181B   -0.354796
baz    one     A    1.285500B   -1.185525two     A    0.645316B   -0.660115
foo    one     A    0.696443B   -1.664527two     A    0.718399B   -0.154125
qux    one     A   -0.740052B    0.713089two     A   -0.672748B   -1.346843
dtype: float64

stacked.unstack()  #装换成以前的样子了
stacked.unstack().unstack()

	A		B
second	one	two	one	two
first
bar	0.072026	-1.099181	0.422077	-0.354796
baz	1.285500	0.645316	-1.185525	-0.660115
foo	0.696443	0.718399	-1.664527	-0.154125
qux	-0.740052	-0.672748	0.713089	-1.346843

#可以指明选择哪一列作为列索引
stacked.unstack(1)

<div>
<table border="1" class="dataframe"><thead><tr style="text-align: right;"><th></th><th>second</th><th>one</th><th>two</th></tr><tr><th>first</th><th></th><th></th><th></th></tr></thead><tbody><tr><th rowspan="2" valign="top">bar</th><th>A</th><td>0.072026</td><td>-1.099181</td></tr><tr><th>B</th><td>0.422077</td><td>-0.354796</td></tr><tr><th rowspan="2" valign="top">baz</th><th>A</th><td>1.285500</td><td>0.645316</td></tr><tr><th>B</th><td>-1.185525</td><td>-0.660115</td></tr><tr><th rowspan="2" valign="top">foo</th><th>A</th><td>0.696443</td><td>0.718399</td></tr><tr><th>B</th><td>-1.664527</td><td>-0.154125</td></tr><tr><th rowspan="2" valign="top">qux</th><th>A</th><td>-0.740052</td><td>-0.672748</td></tr><tr><th>B</th><td>0.713089</td><td>-1.346843</td></tr></tbody>
</table>
</div>

数据透视表

数据透视是指查看里面的一部分数据

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,'B' : ['A', 'B', 'C'] * 4,'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,'D' : np.random.randn(12),'E' : np.random.randn(12)})
df

	A	B	C	D	E
0	one	A	foo	1.477533	1.557713
1	one	B	foo	0.019528	2.483014
2	two	C	foo	-0.912452	0.409732
3	three	A	bar	0.502807	-0.462401
4	one	B	bar	1.709597	-1.739413
5	one	C	bar	-0.658155	1.302735
6	two	A	foo	0.007806	0.782926
7	three	B	foo	-0.067922	-0.193820
8	one	C	foo	0.806713	0.383870
9	one	A	bar	0.794017	0.749756
10	two	B	bar	-0.532554	-0.811900
11	three	C	bar	0.464731	1.168423

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

	C	bar	foo
A	B
one	A	0.794017	1.477533
	B	1.709597	0.019528
	C	-0.658155	0.806713
three	A	0.502807	NaN
	B	NaN	-0.067922
	C	0.464731	NaN
two	A	NaN	0.007806
	B	-0.532554	NaN
	C	NaN	-0.912452

**当这个数据透视表对应多个值时，回去取平均值**

时间序列

rng = pd.date_range('20160301', periods=600, freq='s')
rng

DatetimeIndex(['2016-03-01 00:00:00', '2016-03-01 00:00:01','2016-03-01 00:00:02', '2016-03-01 00:00:03','2016-03-01 00:00:04', '2016-03-01 00:00:05','2016-03-01 00:00:06', '2016-03-01 00:00:07','2016-03-01 00:00:08', '2016-03-01 00:00:09',...'2016-03-01 00:09:50', '2016-03-01 00:09:51','2016-03-01 00:09:52', '2016-03-01 00:09:53','2016-03-01 00:09:54', '2016-03-01 00:09:55','2016-03-01 00:09:56', '2016-03-01 00:09:57','2016-03-01 00:09:58', '2016-03-01 00:09:59'],dtype='datetime64[ns]', length=600, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

2016-03-01 00:00:00     34
2016-03-01 00:00:01      4
2016-03-01 00:00:02    382
2016-03-01 00:00:03    164
2016-03-01 00:00:04    178
2016-03-01 00:00:05    421
2016-03-01 00:00:06     34
2016-03-01 00:00:07     71
2016-03-01 00:00:08    316
2016-03-01 00:00:09    201
2016-03-01 00:00:10    214
2016-03-01 00:00:11    443
2016-03-01 00:00:12    185
2016-03-01 00:00:13     79
2016-03-01 00:00:14     38
2016-03-01 00:00:15    465
2016-03-01 00:00:16    309
2016-03-01 00:00:17     93
2016-03-01 00:00:18     20
2016-03-01 00:00:19    338
2016-03-01 00:00:20    149
2016-03-01 00:00:21     34
2016-03-01 00:00:22    257
2016-03-01 00:00:23    462
2016-03-01 00:00:24     41
2016-03-01 00:00:25    471
2016-03-01 00:00:26    313
2016-03-01 00:00:27    224
2016-03-01 00:00:28     78
2016-03-01 00:00:29    498...
2016-03-01 00:09:30     61
2016-03-01 00:09:31    315
2016-03-01 00:09:32    388
2016-03-01 00:09:33    391
2016-03-01 00:09:34    263
2016-03-01 00:09:35     11
2016-03-01 00:09:36     61
2016-03-01 00:09:37    400
2016-03-01 00:09:38    109
2016-03-01 00:09:39    135
2016-03-01 00:09:40    267
2016-03-01 00:09:41    248
2016-03-01 00:09:42    469
2016-03-01 00:09:43    155
2016-03-01 00:09:44    284
2016-03-01 00:09:45    168
2016-03-01 00:09:46    228
2016-03-01 00:09:47    244
2016-03-01 00:09:48    442
2016-03-01 00:09:49    450
2016-03-01 00:09:50    226
2016-03-01 00:09:51    370
2016-03-01 00:09:52    192
2016-03-01 00:09:53    325
2016-03-01 00:09:54     82
2016-03-01 00:09:55    154
2016-03-01 00:09:56    285
2016-03-01 00:09:57     22
2016-03-01 00:09:58     48
2016-03-01 00:09:59    171
Freq: S, dtype: int32

#重新采样
ts.resample('2Min', how='sum')

2016-03-01 00:00:00    28595
2016-03-01 00:02:00    29339
2016-03-01 00:04:00    28991
2016-03-01 00:06:00    30789
2016-03-01 00:08:00    30131
Freq: 2T, dtype: int32

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng

PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2','1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4','1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2','1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4','1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2','1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4','1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2','2000Q3', '2000Q4'],dtype='int64', freq='Q-NOV')

prng.to_timestamp()   #转成标准格式数据 年-月-日
pd.Timestamp('20160301')-pd.Timestamp('2016') #计算时间差

类别数据

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df

	id	raw_grade
0	1	a
1	2	b
2	3	b
3	4	a
4	5	a
5	6	e

df["grade"] = df["raw_grade"].astype("category")
df

	id	raw_grade	grade
0	1	a	a
1	2	b	b
2	3	b	b
3	4	a	a
4	5	a	a
5	6	e	e

df["grade"].cat.categories

Index([u'a', u'b', u'e'], dtype='object')

df["grade"].cat.categories = ["very good", "good", "very bad"]
df

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

df.sort_values(by='grade', ascending=True)

	id	raw_grade	grade
0	1	a	very good
3	4	a	very good
4	5	a	very good
1	2	b	good
2	3	b	good
5	6	e	very bad

画图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('20000101', periods=1000))
ts = ts.cumsum()
ts

2000-01-01     0.416424
2000-01-02     0.603304
2000-01-03    -0.237965
2000-01-04     0.317450
2000-01-05     0.665045
2000-01-06     2.468087
2000-01-07     2.758852
2000-01-08     2.271343
2000-01-09     3.129609
2000-01-10     5.171241
2000-01-11     5.049896
2000-01-12     5.185316
2000-01-13     4.169058
2000-01-14     2.862306
2000-01-15     4.018617
2000-01-16     4.456694
2000-01-17     5.824236
2000-01-18     6.094983
2000-01-19     5.880954
2000-01-20     5.875111
2000-01-21     6.008481
2000-01-22     6.835501
2000-01-23     7.480405
2000-01-24     6.849335
2000-01-25     7.608887
2000-01-26     9.029474
2000-01-27     8.859222
2000-01-28     7.162806
2000-01-29     7.398013
2000-01-30     7.391844...
2002-08-28    21.728409
2002-08-29    21.757852
2002-08-30    21.047643
2002-08-31    20.114996
2002-09-01    18.769902
2002-09-02    17.417680
2002-09-03    17.917688
2002-09-04    18.064786
2002-09-05    19.312356
2002-09-06    18.633479
2002-09-07    17.711879
2002-09-08    19.162369
2002-09-09    19.697896
2002-09-10    18.895018
2002-09-11    18.590989
2002-09-12    17.278925
2002-09-13    17.730168
2002-09-14    19.058526
2002-09-15    18.898382
2002-09-16    17.048621
2002-09-17    16.443233
2002-09-18    16.842284
2002-09-19    14.627031
2002-09-20    15.500982
2002-09-21    14.640444
2002-09-22    13.183795
2002-09-23    13.383657
2002-09-24    13.006229
2002-09-25    12.311008
2002-09-26    11.674804
Freq: D, dtype: float64

ts.plot()

数据读写

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df

	A	B	C	D
0	-1.052421	-0.164992	3.098604	-0.966960
1	1.194177	0.086880	0.496095	0.265308
2	0.297724	1.284297	-0.130855	-0.229570
3	-0.787063	0.553680	0.546853	-0.322599
4	0.033174	-1.222281	0.320090	-1.749333
5	0.109575	0.310684	1.620296	-0.928869
6	0.761408	-0.027630	0.458341	-0.785370
7	-1.150479	-0.718584	1.028866	0.419026
8	-2.906881	-0.295700	-0.342306	-0.765172
9	0.916363	-1.181429	-1.559657	-1.171191
10	0.578659	0.804726	1.299496	0.176843
11	0.150659	-0.162833	-1.086055	1.240432
12	-0.819219	1.668234	0.217604	-0.779170
13	-0.550658	-0.672640	-0.674157	-0.637602
14	0.901584	0.046023	0.244370	0.374293
15	0.971181	-0.442618	0.179083	0.086095
16	-0.570786	-1.019239	1.684833	0.539140
17	-1.432314	1.369588	2.091300	0.733526
18	-1.115526	-0.115884	2.636074	-0.788859
19	1.601554	1.226182	0.169308	-0.616585
20	0.571316	0.542432	0.306595	0.780939
21	-0.540414	1.036656	0.683224	-0.116963
22	1.319110	-1.265207	1.371924	0.881560
23	1.584346	-1.719633	-1.365020	-0.617224
24	-0.440420	-0.799265	0.376128	-0.654581
25	-0.261730	-0.046325	-0.289009	0.505634
26	0.385047	0.112723	0.428345	-0.008455
27	-0.921668	1.609848	1.592532	-0.623103
28	0.280799	-0.231821	-1.589829	-1.791286
29	0.661562	0.621305	0.921586	-0.312834
...	...	...	...	...
70	0.064385	0.669585	-1.347073	0.941348
71	-1.534420	-1.227736	0.459771	-1.150254
72	0.010741	0.062820	-1.098301	1.268482
73	-1.183586	1.159889	-0.186617	-0.847210
74	-0.705815	-0.371896	0.313020	0.035314
75	-2.945315	-0.421227	-0.403479	1.387825
76	-0.122383	0.474282	-2.039155	-0.155960
77	0.921353	-0.430436	-0.599253	0.911030
78	0.018444	0.098611	0.320480	0.001282
79	-0.188301	-2.015690	-0.427172	-0.146939
80	-0.006022	0.213421	1.358382	-0.414890
81	0.596546	0.042708	1.325342	-0.800222
82	-1.736245	-0.056213	-0.415892	-0.360570
83	0.463591	-0.404202	0.577191	0.336023
84	-1.397557	0.442012	0.007915	-1.305628
85	-0.137766	-0.771713	0.200956	-0.365344
86	0.988833	-0.165965	-0.893573	-0.318324
87	1.093799	1.694406	-0.868420	0.100202
88	-0.240628	0.539268	-1.094841	1.737569
89	1.850923	-0.472270	-2.317345	-0.544395
90	0.617284	1.224130	-1.722366	0.236574
91	1.282967	0.738570	1.748848	-0.106646
92	0.775707	-0.494293	-1.098466	0.372206
93	-0.846466	0.735144	1.456520	1.622817
94	-0.860999	1.146650	-1.064013	1.400919
95	-0.095498	-1.849518	2.303532	0.688425
96	-0.017921	-0.558700	-1.061605	0.781250
97	-1.069070	1.106837	-1.936800	-0.782616
98	0.436267	0.463537	0.614982	-0.123774
99	-1.440635	-1.506836	-0.386824	1.118260

100 rows × 4 columns

df.to_csv('data.csv')
%ls
%more data.csv

# pd.read_csv('data.csv')
pd.read_csv('data.csv', index_col=0)#以第一列作为索引

电影数据分析

准备工作

从网站 grouplens.org/datasets/movielens 下载 MovieLens 1M Dataset 数据。

数据说明

参阅数据介绍文件 README.txt

利用 Pandas 分析电影评分数据

数据读取
数据合并
统计电影平均得分
统计活跃电影 -> 获得评分的次数越多说明电影越活跃
女生最喜欢的电影排行榜
男生最喜欢的电影排行榜
男女生评分差距最大的电影 -> 某类电影女生喜欢，但男生不喜欢
最具争议的电影排行榜 -> 评分的方差最大

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

user_names = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=user_names, engine='python')rating_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rating_names, engine='python')movie_names = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=movie_names, engine='python')

pandas内部有两种实现方式，一种是C语言，一种是python，这里指定python，因为C语言功能不如python多。

print len(users)
users.head(5)

	user_id	gender	age	occupation	zip
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

数据合并

data = pd.merge(pd.merge(users, ratings), movies)

len(data)
data.head(5)

	user_id	gender	age	occupation	zip	movie_id	rating	timestamp	title	genres
0	1	F	1	10	48067	1193	5	978300760	One Flew Over the Cuckoo's Nest (1975)	Drama
1	2	M	56	16	70072	1193	5	978298413	One Flew Over the Cuckoo's Nest (1975)	Drama
2	12	M	25	12	32793	1193	4	978220179	One Flew Over the Cuckoo's Nest (1975)	Drama
3	15	M	25	7	22903	1193	4	978199279	One Flew Over the Cuckoo's Nest (1975)	Drama
4	17	M	50	1	95350	1193	5	978158471	One Flew Over the Cuckoo's Nest (1975)	Drama

data[data.user_id == 1]  #查看你用户1对所有电影的评分情况

# 按性别查看各个电影的平均评分
mean_ratings_gender = data.pivot_table(values='rating', index='title', columns='gender', aggfunc='mean')
mean_ratings_gender.head(5)

gender	F	M
title
$1,000,000 Duck (1971)	3.375000	2.761905
'Night Mother (1986)	3.388889	3.352941
'Til There Was You (1997)	2.675676	2.733333
'burbs, The (1989)	2.793478	2.962085
...And Justice for All (1979)	3.828571	3.689024

# 男女意见想差最大的电影 -> 价值观/品味冲突
mean_ratings_gender['diff'] = mean_ratings_gender.F - mean_ratings_gender.M
mean_ratings_gender.head(5)

gender	F	M	diff
title
$1,000,000 Duck (1971)	3.375000	2.761905	0.613095
'Night Mother (1986)	3.388889	3.352941	0.035948
'Til There Was You (1997)	2.675676	2.733333	-0.057658
'burbs, The (1989)	2.793478	2.962085	-0.168607
...And Justice for All (1979)	3.828571	3.689024	0.139547

mean_ratings_gender.sort_values(by='diff', ascending=True).head(10)

gender	F	M	diff
title
Tigrero: A Film That Was Never Made (1994)	1	4.333333	-3.333333
Neon Bible, The (1995)	1	4.000000	-3.000000
Enfer, L' (1994)	1	3.750000	-2.750000
Stalingrad (1993)	1	3.593750	-2.593750
Killer: A Journal of Murder (1995)	1	3.428571	-2.428571
Dangerous Ground (1997)	1	3.333333	-2.333333
In God's Hands (1998)	1	3.333333	-2.333333
Rosie (1998)	1	3.333333	-2.333333
Flying Saucer, The (1950)	1	3.300000	-2.300000
Jamaica Inn (1939)	1	3.142857	-2.142857

# 活跃电影排行榜
ratings_by_movie_title = data.groupby('title').size()
ratings_by_movie_title.head(5)

title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64

# 前二十大高分电影 -> 平均评分最高的电影
mean_ratings = data.pivot_table(values='rating', index='title', aggfunc='mean')
top_20_mean_ratings = mean_ratings.sort_values(ascending=False).head(20)
top_20_mean_ratings

title
Gate of Heavenly Peace, The (1995)                                     5.000000
Lured (1947)                                                           5.000000
Ulysses (Ulisse) (1954)                                                5.000000
Smashing Time (1967)                                                   5.000000
Follow the Bitch (1998)                                                5.000000
Song of Freedom (1936)                                                 5.000000
Bittersweet Motel (2000)                                               5.000000
Baby, The (1973)                                                       5.000000
One Little Indian (1973)                                               5.000000
Schlafes Bruder (Brother of Sleep) (1995)                              5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                                    4.800000
Lamerica (1994)                                                        4.750000
Apple, The (Sib) (1998)                                                4.666667
Sanjuro (1962)                                                         4.608696
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)    4.560510
Shawshank Redemption, The (1994)                                       4.554558
Godfather, The (1972)                                                  4.524966
Close Shave, A (1995)                                                  4.520548
Usual Suspects, The (1995)                                             4.517106
Schindler's List (1993)                                                4.510417
Name: rating, dtype: float64

# 前十大热闹电影的平均评分 -> 不一定越热闹的电影，评分越高
mean_ratings[top_10_ratings.index]

title
American Beauty (1999)                                   4.317386
Star Wars: Episode IV - A New Hope (1977)                4.453694
Star Wars: Episode V - The Empire Strikes Back (1980)    4.292977
Star Wars: Episode VI - Return of the Jedi (1983)        4.022893
Jurassic Park (1993)                                     3.763847
Saving Private Ryan (1998)                               4.337354
Terminator 2: Judgment Day (1991)                        4.058513
Matrix, The (1999)                                       4.315830
Back to the Future (1985)                                3.990321
Silence of the Lambs, The (1991)                         4.351823
Name: rating, dtype: float64

# 前二十大高分电影的热闹程度 -> 不一定评分越高的电影越热闹，可能某个很小众的电影看得人少，但评分很高
ratings_by_movie_title[top_20_mean_ratings.index]

title
Gate of Heavenly Peace, The (1995)                                        3
Lured (1947)                                                              1
Ulysses (Ulisse) (1954)                                                   1
Smashing Time (1967)                                                      2
Follow the Bitch (1998)                                                   1
Song of Freedom (1936)                                                    1
Bittersweet Motel (2000)                                                  1
Baby, The (1973)                                                          1
One Little Indian (1973)                                                  1
Schlafes Bruder (Brother of Sleep) (1995)                                 1
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                                       5
Lamerica (1994)                                                           8
Apple, The (Sib) (1998)                                                   9
Sanjuro (1962)                                                           69
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)     628
Shawshank Redemption, The (1994)                                       2227
Godfather, The (1972)                                                  2223
Close Shave, A (1995)                                                   657
Usual Suspects, The (1995)                                             1783
Schindler's List (1993)                                                2304
dtype: int64

# 十大好电影 -> 活跃度超过 1000 的高分电影
top_10_movies = mean_ratings[top_ratings.index].sort_values(ascending=False).head(10)
top_10_movies

title
Shawshank Redemption, The (1994)                                               4.554558
Godfather, The (1972)                                                          4.524966
Usual Suspects, The (1995)                                                     4.517106
Schindler's List (1993)                                                        4.510417
Raiders of the Lost Ark (1981)                                                 4.477725
Rear Window (1954)                                                             4.476190
Star Wars: Episode IV - A New Hope (1977)                                      4.453694
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)    4.449890
Casablanca (1942)                                                              4.412822
Sixth Sense, The (1999)                                                        4.406263
Name: rating, dtype: float64

pandas核心数据结构

import pandas as pd
import numpy as np

Series

Series 是一维带标签的数组，数组里可以放任意的数据（整数，浮点数，字符串，Python Object）。其基本的创建函数是：

s = pd.Series(data, index=index)

其中 index 是一个列表，用来作为数据的标签。data 可以是不同的数据类型：

Python 字典
ndarray 对象
一个标量值，如 5

从 ndaray 创建

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.747292
b   -1.120276
c   -0.132692
d   -0.267813
e   -0.590904
dtype: float64

s.index

Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

s = pd.Series(np.random.randn(5))
s

0    0.324214
1   -0.183776
2   -0.518808
3    0.866421
4   -0.601668
dtype: float64

s.index

Int64Index([0, 1, 2, 3, 4], dtype='int64')

从字典创建

# 空值的默认处理
d = {'a' : 0., 'b' : 1., 'd' : 3}
s = pd.Series(d, index=list('abcd'))
s

a     0
b     1
c   NaN
d     3
dtype: float64

从标量创建

pd.Series(3, index=list('abcde'))

a    3
b    3
c    3
d    3
e    3
dtype: int64

Series特性

Series 是类 ndarray 对象

熟悉 numpy 的同学对下面的操作应该不会陌生。我们在 numpy 简介里也介绍过下面的索引方式。

s = pd.Series(np.random.randn(5))
s

0    0.882069
1   -0.134360
2   -0.925088
3    0.191072
4    2.546704
dtype: float64

s[0]

0.88206876023157332

s[:3]

0    0.882069
1   -0.134360
2   -0.925088
dtype: float64

s[[1, 3, 4]]

1   -0.134360
3    0.191072
4    2.546704
dtype: float64

np.exp(s)

0     2.415892
1     0.874275
2     0.396497
3     1.210546
4    12.764963
dtype: float64

np.sin(s)

0    0.772055
1   -0.133957
2   -0.798673
3    0.189911
4    0.560416
dtype: float64

Series 是类字典对象

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -2.149840
b   -0.924115
c    0.481231
d    1.033813
e   -0.462794
dtype: float64

s['a']

-2.1498403551053218

s['e'] = 5

a   -2.149840
b   -0.924115
c    0.481231
d    1.033813
e    5.000000
dtype: float64

s['g'] = 100

a     -2.149840
b     -0.924115
c      0.481231
d      1.033813
e      5.000000
g    100.000000
dtype: float64

'e' in s

True

'f' in s

False

# s['f']

print s.get('f')

None

print s.get('f', np.nan)

nan

标签对齐操作

s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'e'])
s2 = pd.Series(np.random.randn(3), index=['a', 'd', 'e'])
print '{0}\n\n{1}'.format(s1, s2)

a   -0.917905
c   -0.744616
e    0.114522
dtype: float64a    0.721087
d   -0.471575
e    0.796093
dtype: float64

s1 + s2

name 属性

s = pd.Series(np.random.randn(5), name='Some Thing')
s

0    0.623787
1    0.517239
2    1.551314
3    1.414463
4   -1.224611
Name: Some Thing, dtype: float64

s.name

'Some Thing'a   -0.196818
c         NaN
d         NaN
e    0.910615
dtype: float64

DataFrame

DataFrame 是二维带行标签和列标签的数组。可以把 DataFrame 想你成一个 Excel 表格或一个 SQL 数据库的表格，还可以相像成是一个 Series 对象字典。它是 Pandas 里最常用的数据结构。

创建 DataFrame 的基本格式是：

df = pd.DataFrame(data, index=index, columns=columns)

其中 index 是行标签，columns 是列标签，data 可以是下面的数据：

由一维 numpy 数组，list，Series 构成的字典
二维 numpy 数组
一个 Series
另外的 DataFrame 对象

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

pd.DataFrame(d)

	one	two
a	1	1
b	2	2
c	3	3
d	NaN	4

pd.DataFrame(d, index=['d', 'b', 'a'])

	one	two
d	NaN	4
b	2	2
a	1	1

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

	two	three
d	4	NaN
b	2	NaN
a	1	NaN

如果是列表，必须元素个数一样，否则会报错

d = {'one' : [1, 2, 3, 4],'two' : [21, 22, 23, 24]}

pd.DataFrame(d)

	one	two
0	1	21
1	2	22
2	3	23
3	4	24

pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

	one	two
a	1	21
b	2	22
c	3	23
d	4	24

从结构化数据中创建

data = [(1, 2.2, 'Hello'), (2, 3., "World")]

pd.DataFrame(data)

	0	1	2
0	1	2.2	Hello
1	2	3.0	World

pd.DataFrame(data, index=['first', 'second'], columns=['A', 'B', 'C'])

	A	B	C
first	1	2.2	Hello
second	2	3.0	World

从字典列表创建

data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

pd.DataFrame(data)

	a	b	c
0	1	2	NaN
1	5	10	20

```python pd.DataFrame(data, index=['first', 'second']) ```

	a	b	c
first	1	2	NaN
second	5	10	20

pd.DataFrame(data, columns=['a', 'b'])

	a	b
0	1	2
1	5	10

从元组字典创建

了解其创建的原理，实际应用中，会通过数据清洗的方式，把数据整理成方便 Pandas 导入且可读性好的格式。最后再通过 reindex/groupby 等方式转换成复杂数据结构。

d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}

# 多级标签
pd.DataFrame(d)

		a			b
		a	b	c	a	b
A	B	4	1	5	8	10
	C	3	2	6	7	NaN
	D	NaN	NaN	NaN	NaN	9

从 Series 创建

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
pd.DataFrame(s)

	0
a	-0.789343
b	0.127384
c	1.084005
d	-0.755011
e	-0.963299

pd.DataFrame(s, index=['a', 'c', 'd'])

	0
a	-0.789343
c	1.084005
d	-0.755011

如果这里指定多列，就会报错，不同于字典作为参数进行创建

pd.DataFrame(s, index=['a', 'c', 'd'], columns=['A'])

	A
a	-0.789343
c	1.084005
d	-0.755011

列选择/增加/删除

df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'])
df

	one	two	three	four
0	2.045300	-0.981722	-0.656081	-0.639517
1	-0.550780	0.248781	-0.146424	0.217392
2	1.702775	0.103998	-0.662138	-0.534071
3	-2.035681	0.015025	1.368209	0.178378
4	-1.092208	0.091108	-0.892496	-0.611198
5	0.093502	0.267428	1.189654	-0.258723

df['one']

0    2.045300
1   -0.550780
2    1.702775
3   -2.035681
4   -1.092208
5    0.093502
Name: one, dtype: float64

df['three'] = df['one'] + df['two']
df

	one	two	three	four
0	2.045300	-0.981722	1.063578	-0.639517
1	-0.550780	0.248781	-0.301999	0.217392
2	1.702775	0.103998	1.806773	-0.534071
3	-2.035681	0.015025	-2.020656	0.178378
4	-1.092208	0.091108	-1.001100	-0.611198
5	0.093502	0.267428	0.360931	-0.258723

df['flag'] = df['one'] > 0
df

	one	two	three	four	flag
0	2.045300	-0.981722	1.063578	-0.639517	True
1	-0.550780	0.248781	-0.301999	0.217392	False
2	1.702775	0.103998	1.806773	-0.534071	True
3	-2.035681	0.015025	-2.020656	0.178378	False
4	-1.092208	0.091108	-1.001100	-0.611198	False
5	0.093502	0.267428	0.360931	-0.258723	True

del df['three']
df

	one	two	four	flag
0	2.045300	-0.981722	-0.639517	True
1	-0.550780	0.248781	0.217392	False
2	1.702775	0.103998	-0.534071	True
3	-2.035681	0.015025	0.178378	False
4	-1.092208	0.091108	-0.611198	False
5	0.093502	0.267428	-0.258723	True

four = df.pop('four')
four

0   -0.639517
1    0.217392
2   -0.534071
3    0.178378
4   -0.611198
5   -0.258723
Name: four, dtype: float64

df

	one	two	flag
0	2.045300	-0.981722	True
1	-0.550780	0.248781	False
2	1.702775	0.103998	True
3	-2.035681	0.015025	False
4	-1.092208	0.091108	False
5	0.093502	0.267428	True

df['five'] = 5
df

	one	two	flag	five
0	2.045300	-0.981722	True	5
1	-0.550780	0.248781	False	5
2	1.702775	0.103998	True	5
3	-2.035681	0.015025	False	5
4	-1.092208	0.091108	False	5
5	0.093502	0.267428	True	5

df['one_trunc'] = df['one'][:2]
df

	one	two	flag	five	one_trunc
0	2.045300	-0.981722	True	5	2.04530
1	-0.550780	0.248781	False	5	-0.55078
2	1.702775	0.103998	True	5	NaN
3	-2.035681	0.015025	False	5	NaN
4	-1.092208	0.091108	False	5	NaN
5	0.093502	0.267428	True	5	NaN

# 指定插入位置
df.insert(1, 'bar', df['one'])
df

	one	bar	two	flag	five	one_trunc
0	2.045300	2.045300	-0.981722	True	5	2.04530
1	-0.550780	-0.550780	0.248781	False	5	-0.55078
2	1.702775	1.702775	0.103998	True	5	NaN
3	-2.035681	-2.035681	0.015025	False	5	NaN
4	-1.092208	-1.092208	0.091108	False	5	NaN
5	0.093502	0.093502	0.267428	True	5	NaN

使用 assign() 方法来插入新列

更方便地使用 methd chains 的方法来实现

df = pd.DataFrame(np.random.randint(1, 5, (6, 4)), columns=list('ABCD'))
df

	A	B	C	D
0	4	3	3	4
1	1	4	4	2
2	1	4	4	3
3	2	4	4	3
4	1	2	4	2
5	3	4	1	4

df.assign(Ratio = df['A'] / df['B'])

	A	B	C	D	Ratio
0	4	3	3	4	1.333333
1	1	4	4	2	0.250000
2	1	4	4	3	0.250000
3	2	4	4	3	0.500000
4	1	2	4	2	0.500000
5	3	4	1	4	0.750000

insert是直接改变原DataFrame，assign是生成一个新的，assign的另一个特性是可以传入函数进行计算。

df.assign(AB_Ratio = lambda x: x.A / x.B, CD_Ratio = lambda x: x.C - x.D)

	A	B	C	D	AB_Ratio	CD_Ratio
0	4	3	3	4	1.333333	-1
1	1	4	4	2	0.250000	2
2	1	4	4	3	0.250000	1
3	2	4	4	3	0.500000	1
4	1	2	4	2	0.500000	2
5	3	4	1	4	0.750000	-3

df.assign(AB_Ratio = lambda x: x.A / x.B).assign(ABD_Ratio = lambda x: x.AB_Ratio * x.D)

	A	B	C	D	AB_Ratio	ABD_Ratio
0	4	3	3	4	1.333333	5.333333
1	1	4	4	2	0.250000	0.500000
2	1	4	4	3	0.250000	0.750000
3	2	4	4	3	0.500000	1.500000
4	1	2	4	2	0.500000	1.000000
5	3	4	1	4	0.750000	3.000000

索引和选择

对应的操作，语法和返回结果

选择一列 -> df[col] -> Series
根据行标签选择一行 -> df.loc[label] -> Series
根据行位置选择一行 -> df.iloc[label] -> Series
选择多行 -> df[5:10] -> DataFrame
根据布尔向量选择多行 -> df[bool_vector] -> DataFrame

df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index=list('abcdef'), columns=list('ABCD'))
df

	A	B	C	D
a	2	2	6	6
b	8	3	5	7
c	4	6	8	3
d	7	8	3	9
e	8	4	4	2
f	4	2	4	3

df['A']

a    2
b    8
c    4
d    7
e    8
f    4
Name: A, dtype: int32

df.loc['a']

A    2
B    2
C    6
D    6
Name: a, dtype: int32

df.iloc[0]

A    2
B    2
C    6
D    6
Name: a, dtype: int32

df[1:4]

	A	B	C	D
b	8	3	5	7
c	4	6	8	3
d	7	8	3	9

使用iloc比这样直接访问效率要高df.iloc[1:4]

df[[False, True, True, False, True, False]]

	A	B	C	D
b	8	3	5	7
c	4	6	8	3
e	8	4	4	2

数据对齐

DataFrame 在进行数据计算时，会自动按行和列进行数据对齐。最终的计算结果会合并两个 DataFrame。

df1 = pd.DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), columns=['A', 'B', 'C', 'D'])
df1

	A	B	C	D
a	0.576428	-0.037913	-0.329787	-1.752916
b	0.406743	-1.044561	-0.724447	0.374599
c	0.073578	0.423914	-1.499770	-0.488374
d	-0.377609	1.137422	-1.951169	-0.814306
e	-2.171648	-2.364502	-0.833594	0.168636
f	-1.134800	-0.927469	0.886889	0.542603
g	0.625104	0.115953	-1.282609	1.031292
h	0.403509	0.263207	0.403614	-0.177888
i	0.148494	-2.034253	0.134859	-0.960650
j	0.094200	-1.803288	0.057472	-0.338958

df2 = pd.DataFrame(np.random.randn(7, 3), index=list('cdefghi'), columns=['A', 'B', 'C'])
df2

	A	B	C
c	0.884518	0.337344	-1.072027
d	0.264036	-0.152542	-0.225544
e	1.048813	-1.496442	1.022348
f	0.895314	-0.890236	1.230465
g	-0.588162	-0.492354	-0.739563
h	-2.580322	1.104810	-0.167137
i	-0.842738	0.171735	0.847714

df1 + df2

	A	B	C	D
a	NaN	NaN	NaN	NaN
b	NaN	NaN	NaN	NaN
c	0.958096	0.761259	-2.571797	NaN
d	-0.113573	0.984880	-2.176713	NaN
e	-1.122834	-3.860944	0.188754	NaN
f	-0.239486	-1.817705	2.117354	NaN
g	0.036942	-0.376401	-2.022171	NaN
h	-2.176813	1.368016	0.236476	NaN
i	-0.694245	-1.862517	0.982573	NaN
j	NaN	NaN	NaN	NaN

df1 - df1.iloc[0]

	A	B	C	D
a	0.000000	0.000000	0.000000	0.000000
b	-0.169685	-1.006648	-0.394660	2.127515
c	-0.502850	0.461827	-1.169983	1.264541
d	-0.954037	1.175335	-1.621382	0.938610
e	-2.748076	-2.326589	-0.503807	1.921551
f	-1.711228	-0.889556	1.216676	2.295518
g	0.048676	0.153866	-0.952822	2.784208
h	-0.172919	0.301119	0.733400	1.575028
i	-0.427934	-1.996340	0.464646	0.792265
j	-0.482228	-1.765375	0.387259	1.413957

这里两个运算对象不是同类型，用到广播原理，生成新的DataFrame，是原来DataFrame的每行减去第0行。

使用 numpy 函数

Pandas 与 numpy 在核心数据结构上是完全兼容的

df = pd.DataFrame(np.random.randn(10, 4), columns=['one', 'two', 'three', 'four'])
df

	one	two	three	four
0	-1.121818	1.233686	0.681618	-0.502204
1	1.469664	-0.060555	-0.044857	0.725021
2	1.219670	0.108709	1.806063	0.332685
3	-0.190615	1.244102	-0.863850	1.795335
4	-0.133109	-0.101591	0.818724	1.246230
5	0.729804	0.716593	2.472841	-0.078224
6	0.010136	1.725441	-1.071194	1.602945
7	1.002507	-1.122593	-0.147411	-1.678843
8	-0.550077	0.230777	-0.658470	-1.680395
9	1.006271	0.455683	-2.279833	-0.823792

np.exp(df)

	one	two	three	four
0	0.325687	3.433864	1.977073	0.605196
1	4.347774	0.941242	0.956134	2.064774
2	3.386069	1.114838	6.086440	1.394708
3	0.826450	3.469817	0.421536	6.021490
4	0.875369	0.903399	2.267604	3.477210
5	2.074675	2.047446	11.856082	0.924757
6	1.010187	5.614995	0.342599	4.967641
7	2.725105	0.325435	0.862939	0.186590
8	0.576905	1.259578	0.517643	0.186300
9	2.735382	1.577250	0.102301	0.438765

np.asarray(df) == df.values

array([[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True]], dtype=bool)

type(np.asarray(df))

numpy.ndarray

np.asarray(df) == df

	one	two	three	four
0	True	True	True	True
1	True	True	True	True
2	True	True	True	True
3	True	True	True	True
4	True	True	True	True
5	True	True	True	True
6	True	True	True	True
7	True	True	True	True
8	True	True	True	True
9	True	True	True	True

Panel

Panel 是三维带标签的数组。实际上，Pandas 的名称由来就是由 Panel 演进的，即 pan(el)-da(ta)-s。Panel 比较少用，但依然是最重要的基础数据结构之一。

items: 坐标轴 0，索引对应的元素是一个 DataFrame
major_axis: 坐标轴 1, DataFrame 里的行标签
minor_axis: 坐标轴 2, DataFrame 里的列标签

data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),'Item2' : pd.DataFrame(np.random.randn(4, 2))}
pn = pd.Panel(data)
pn

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

pn['Item1']

	0	1	2
0	0.638298	-1.600822	3.112210
1	0.394099	0.184129	0.438450
2	0.427692	-0.294556	0.039430
3	1.555046	0.933749	0.218616

pn.items

Index([u'Item1', u'Item2'], dtype='object')

pn.major_axis

Int64Index([0, 1, 2, 3], dtype='int64')

pn.minor_axis

Int64Index([0, 1, 2], dtype='int64')

# 函数调用
pn.major_xs(pn.major_axis[0])

	Item1	Item2
0	0.638298	-1.427579
1	-1.600822	-0.778090
2	3.112210	NaN

# 函数调用
pn.minor_xs(pn.major_axis[1])

	Item1	Item2
0	-1.600822	-0.778090
1	0.184129	0.698347
2	-0.294556	-0.167423
3	0.933749	0.205092

pn.to_frame()

		Item1	Item2
major	minor
0	0	0.638298	-1.427579
0	1	-1.600822	-0.778090
1	0	0.394099	-0.999929
1	1	0.184129	0.698347
2	0	0.427692	0.559905
2	1	-0.294556	-0.167423
3	0	1.555046	-1.992102
3	1	0.933749	0.205092

自动去除了NaN一列

基础运算

import pandas as pd
import numpy as np

重新索引

Series

s = pd.Series([1, 3, 5, 6, 8], index=list('acefh'))
s

a    1
c    3
e    5
f    6
h    8
dtype: int64

s.reindex(list('abcdefgh'))

a     1
b   NaN
c     3
d   NaN
e     5
f     6
g   NaN
h     8
dtype: float64

s.reindex(list('abcdefgh'), fill_value=0)

a    1
b    0
c    3
d    0
e    5
f    6
g    0
h    8
dtype: int64

# method='bfill'
s.reindex(list('abcdefgh'), method='ffill')

a    1
b    1
c    3
d    3
e    5
f    6
g    6
h    8
dtype: int64

DataFrame

df = pd.DataFrame(np.random.randn(4, 6), index=list('ADFH'), columns=['one', 'two', 'three', 'four', 'five', 'six'])
df

	one	two	three	four	five	six
A	-0.049437	-0.526499	1.780662	1.154747	2.434957	-1.579278
D	-0.075226	0.552163	-0.462732	-0.936051	-0.590041	0.484505
F	1.486168	0.725907	0.598127	-0.704809	-2.815687	-0.062462
H	-0.900819	-0.177751	-0.232796	0.234088	-1.758574	1.255955

df2 = df.reindex(index=list('ABCDEFGH'))
df2

默认也可以不写index，但columns得写

	one	two	three	four	five	six
A	-0.049437	-0.526499	1.780662	1.154747	2.434957	-1.579278
B	NaN	NaN	NaN	NaN	NaN	NaN
C	NaN	NaN	NaN	NaN	NaN	NaN
D	-0.075226	0.552163	-0.462732	-0.936051	-0.590041	0.484505
E	NaN	NaN	NaN	NaN	NaN	NaN
F	1.486168	0.725907	0.598127	-0.704809	-2.815687	-0.062462
G	NaN	NaN	NaN	NaN	NaN	NaN
H	-0.900819	-0.177751	-0.232796	0.234088	-1.758574	1.255955

reindex是拷贝一份数据出来

df.reindex(columns=['one', 'three', 'five', 'seven'])

	one	three	five	seven
A	100.000000	1.780662	2.434957	NaN
D	-0.075226	-0.462732	-0.590041	NaN
F	1.486168	0.598127	-2.815687	NaN
H	-0.900819	-0.232796	-1.758574	NaN

df.reindex(columns=['one', 'three', 'five', 'seven'], fill_value=0)

	one	three	five
A	100.000000	1.780662	2.434957
D	-0.075226	-0.462732	-0.590041
F	1.486168	0.598127	-2.815687
H	-0.900819	-0.232796	-1.758574

# fill method 只对行有效
df.reindex(columns=['one', 'three', 'five', 'seven'], method='ffill')

	one	three	five	seven
A	100.000000	1.780662	2.434957	NaN
D	-0.075226	-0.462732	-0.590041	NaN
F	1.486168	0.598127	-2.815687	NaN
H	-0.900819	-0.232796	-1.758574	NaN

df.reindex(index=list('ABCDEFGH'), method='ffill')

	one	two	three	four	five	six
A	100.000000	-0.526499	1.780662	1.154747	2.434957	-1.579278
B	100.000000	-0.526499	1.780662	1.154747	2.434957	-1.579278
C	100.000000	-0.526499	1.780662	1.154747	2.434957	-1.579278
D	-0.075226	0.552163	-0.462732	-0.936051	-0.590041	0.484505
E	-0.075226	0.552163	-0.462732	-0.936051	-0.590041	0.484505
F	1.486168	0.725907	0.598127	-0.704809	-2.815687	-0.062462
G	1.486168	0.725907	0.598127	-0.704809	-2.815687	-0.062462
H	-0.900819	-0.177751	-0.232796	0.234088	-1.758574	1.255955

丢弃部分数据

df = pd.DataFrame(np.random.randn(4, 6), index=list('ABCD'), columns=['one', 'two', 'three', 'four', 'five', 'six'])
df

	one	two	three	four	five	six
A	-0.665415	-0.061367	0.075058	0.626415	-1.748458	-0.608540
B	-1.455186	1.846691	0.234276	0.660298	-2.169835	-1.476485
C	0.322281	0.505378	0.198458	-0.831919	-0.630789	0.762524
D	0.703684	-0.827597	0.178063	0.108453	-0.418992	0.242912

df.drop('A')

	one	two	three	four	five	six
B	-1.455186	1.846691	0.234276	0.660298	-2.169835	-1.476485
C	0.322281	0.505378	0.198458	-0.831919	-0.630789	0.762524
D	0.703684	-0.827597	0.178063	0.108453	-0.418992	0.242912

df2 = df.drop(['two', 'four'], axis=1)
df2

	one	three	five	six
A	-0.665415	0.075058	-1.748458	-0.608540
B	-1.455186	0.234276	-2.169835	-1.476485
C	0.322281	0.198458	-0.630789	0.762524
D	0.703684	0.178063	-0.418992	0.242912

drop是是拷贝一份数据出来的

广播运算

df = pd.DataFrame(np.arange(12).reshape(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df

	A	B	C
one	0	1	2
two	3	4	5
three	6	7	8
four	9	10	11

df.loc['one']

A    0
B    1
C    2
Name: one, dtype: int64

df - df.loc['one']

	A	B	C
one	0	0	0
two	3	3	3
three	6	6	6
four	9	9	9

函数应用

apply: 将数据按行或列进行计算
applymap: 将数据按元素为进行计算

df = pd.DataFrame(np.arange(12).reshape(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df

	A	B	C
one	0	1	2
two	3	4	5
three	6	7	8
four	9	10	11

# 每一列作为一个 Series 作为参数传递给 lambda 函数
df.apply(lambda x: x.max() - x.min())

A    9
B    9
C    9
dtype: int64

# 每一行作为一个 Series 作为参数传递给 lambda 函数
df.apply(lambda x: x.max() - x.min(), axis=1)

one      2
two      2
three    2
four     2
dtype: int64

# 返回多个值组成的 Series
def min_max(x):return pd.Series([x.min(), x.max()], index=['min', 'max'])
df.apply(min_max, axis=1)

	min	max
one	0	2
two	3	5
three	6	8
four	9	11

# applymap: 逐元素运算
df = pd.DataFrame(np.random.randn(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df

	A	B	C
one	-1.126089	-0.286584	1.538841
two	1.804348	-0.709293	-0.400643
three	-1.008037	-0.791648	0.388505
four	-0.071827	0.659098	-0.505030

formater = '{0:.02f}'.format
# formater = lambda x: '%.02f' % x
df.applymap(formater)

0表示第0个参数，formater的写法也是一种函数

	A	B	C
one	-1.13	-0.29	1.54
two	1.80	-0.71	-0.40
three	-1.01	-0.79	0.39
four	-0.07	0.66	-0.51

排序和排名

df = pd.DataFrame(np.random.randint(1, 10, (4, 3)), index=list('ABCD'), columns=['one', 'two', 'three'])
df

	one	two	three
A	7	8	5
B	2	6	4
C	4	5	4
D	6	2	1

df.sort_values(by='one')

	one	two	three
B	2	6	4
C	4	5	4
D	6	2	1
A	7	8	5

s = pd.Series([3, 6, 2, 6, 4])

s.rank()

0    2.0
1    4.5
2    1.0
3    4.5
4    3.0
dtype: float64

s.rank(method='first', ascending=False)

method='first’表示并列时，先出现排前面
默认是average，还有一种是last

0    4
1    1
2    5
3    2
4    3
dtype: float64

DataFrame默认是按照每列进行排名，与sort_values不同

数据唯一性及成员资格

适用于 Series

s = pd.Series(list('abbcdabacad'))
s

0     a
1     b
2     b
3     c
4     d
5     a
6     b
7     a
8     c
9     a
10    d
dtype: object

s.unique()

array(['a', 'b', 'c', 'd'], dtype=object)

s.value_counts()

a    4
b    3
d    2
c    2
dtype: int64

s.isin(['a', 'b', 'c'])

0      True
1      True
2      True
3      True
4     False
5      True
6      True
7      True
8      True
9      True
10    False
dtype: bool