文章目录

  • pandas入门
    • 矩阵运算
    • 重建索引
    • 统计
    • 数据合并
    • 分组统计
    • 数据整形
    • 数据透视表
    • 时间序列
    • 类别数据
    • 画图
    • 数据读写
  • 电影数据分析
    • 准备工作
    • 数据说明
    • 利用 Pandas 分析电影评分数据
    • 数据合并
  • pandas核心数据结构
    • Series
      • 从 ndaray 创建
      • 从字典创建
      • 从标量创建
    • Series特性
      • Series 是类 ndarray 对象
      • Series 是类字典对象
      • 标签对齐操作
      • name 属性
    • DataFrame
      • 从结构化数据中创建
      • 从字典列表创建
      • 从元组字典创建
      • 从 Series 创建
      • 列选择/增加/删除
      • 使用 assign() 方法来插入新列
      • 索引和选择
      • 数据对齐
      • 使用 numpy 函数
    • Panel
  • 基础运算
  • 重新索引
    • Series
    • DataFrame
  • 丢弃部分数据
    • 广播运算
    • 函数应用
    • 排序和排名
    • 数据唯一性及成员资格

pandas入门

首先,下面所有的例子均需要导入包:

# 设置为 inline 风格
%matplotlib inline
# 包导入
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

一维数组:Series:

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

输出:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

二维数组创建:

dates = pd.date_range('20160301', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

输出

A B C D
2016-03-01 -0.859805 -0.069692 -0.905092 -0.553213
2016-03-02 -0.353785 0.031793 -0.785213 -0.212337
2016-03-03 1.719976 0.925145 0.241639 -0.490166
2016-03-04 -1.207854 -0.001647 -0.468976 -0.781144
2016-03-05 0.452034 1.371208 1.152729 1.470498
2016-03-06 1.378227 0.246941 -1.186630 -0.411647
df.valuesarray([[-8.59804740e-01, -6.96922374e-02, -9.05091676e-01,-5.53212518e-01],[-3.53785450e-01,  3.17933613e-02, -7.85212585e-01,-2.12337229e-01],[ 1.71997643e+00,  9.25144720e-01,  2.41639347e-01,-4.90166361e-01],[-1.20785422e+00, -1.64720630e-03, -4.68976209e-01,-7.81144372e-01],[ 4.52033577e-01,  1.37120779e+00,  1.15272905e+00,1.47049771e+00],[ 1.37822701e+00,  2.46941166e-01, -1.18662963e+00,-4.11647030e-01]])

用字典来构建DataFrame,如下:

# 使用字典来创建:key 为 DataFrame 的列;value 为对应列下的值
df1 = pd.DataFrame({'A': 1,'B': pd.Timestamp('20160301'),'C': range(4),'D': np.arange(5, 9),'E': 'text','F': ['AA', 'BB', 'CC', 'DD']})
df1
A B C D E F
0 1 2016-03-01 0 5 text AA
1 1 2016-03-01 1 6 text BB
2 1 2016-03-01 2 7 text CC
3 1 2016-03-01 3 8 text DD

各个列可以通过DataFrame的属性来访问,如

df.A、df.B
type(df.A)
pandas.core.series.Series

可以看出,每一列或者每一行,都是Series数据

df.types  #查看每一列的数据类型
df.shape  #查看数据区的矩阵行列数
df.head() #查看矩阵的前五行,在函数内指定数字,可以查看指定行数
df.tail() #查看矩阵的后五行
df.index  #矩阵的行索引
df.columns#矩阵的列索引
df.values #查看矩阵的值,返回的是一个ndarray类型矩阵
df.describe()#查看矩阵的统计信息,默认是axis=0,及每一列的各行数据参与运算,生成每一列的统计信息,包括(每一列有多少值,平均值,标准差,最小值,四分位,中位数,四分之三分位,最大值),其中四分位的计算方法是减1计算法,具体可以查看百度百科
df.T      #矩阵的转置,或者使用transpose,类似numpy的方法
df.sort_index()#将行索引排序,参数axis可以指定为1,对列索引排序,ascending表示降序还是升序
df.sort_values(by='A')#依据每一列的值进行排序
df.['A']  #选择某列
df.A
df[2:4]   #选择行,数字也可以换成行索引,布尔值索引,但当行索引也是数字时,默认使用数字,使用索引的切片,是包含最后一个元素所在的行的
df.loc[]  #使用索引来查找数据
df.iloc[] #使用数字来查找数据
#例子
df.loc[:,['B','C']],也可以访问某个特定的值
#
df.at[,]     #根据行列标签(索引)选择数据
df.iat[,]    #根据数字选择数据df[df.A>0]   #根据布尔索引,找出A列大于0的行
df[df>0]     #非零的数据为NaN

矩阵运算

df['tag'] = ['a'] * 2 + ['b'] * 2 + ['c'] * 2
A B C D tag
2016-03-01 0.033228 2.307123 -0.585367 -1.671832 a
2016-03-02 -1.967299 0.727670 -0.190863 -0.163514 a
2016-03-03 0.065359 0.696804 -0.550040 0.717347 b
2016-03-04 0.234850 0.289520 -1.087173 1.534277 b
2016-03-05 -1.459620 1.040987 0.220130 -0.068131 c
2016-03-06 0.865402 2.650889 -0.015460 -0.111889 c
df[df.tag.isin(['a', 'c'])]
A B C D tag
2016-03-01 0.033228 2.307123 -0.585367 -1.671832 a
2016-03-02 -1.967299 0.727670 -0.190863 -0.163514 a
2016-03-05 -1.459620 1.040987 0.220130 -0.068131 c
2016-03-06 0.865402 2.650889 -0.015460 -0.111889 c

### 修改元素

s = pd.Series(np.arange(6), index=pd.date_range('20160301', periods=6))
df['E'] = s   # 修改一列
df.at[pd.Timestamp('20160301'), 'A'] = 0.4  #  修改某个元素
df.B = 200  #将一列的值都修改成一个值
df.iloc[:,2:5] = 1000 #修改子表

要结束网页的jupyter notebook,在控制台两次ctrl+c

重建索引

dates = pd.date_range('20160301', periods=6)
df = pd.DataFrame(data=np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
A B C D
2016-03-01 -0.985666 0.240058 0.716721 0.352009
2016-03-02 -1.563644 0.091766 1.081764 0.951541
2016-03-03 0.279760 -0.316136 1.198073 -0.562947
2016-03-04 1.174777 -0.225305 -0.280256 -0.074768
2016-03-05 2.173366 0.907038 -1.104678 -0.921779
2016-03-06 0.200422 0.442619 1.970330 -0.609867
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1
A B C D E
2016-03-01 -0.985666 0.240058 0.716721 0.352009 NaN
2016-03-02 -1.563644 0.091766 1.081764 0.951541 NaN
2016-03-03 0.279760 -0.316136 1.198073 -0.562947 NaN
2016-03-04 1.174777 -0.225305 -0.280256 -0.074768 NaN

### 处理丢失数据

df1.loc[dates[1:3], 'E'] = 1
df1
A B C D E
2016-03-01 -0.985666 0.240058 0.716721 0.352009 NaN
2016-03-02 -1.563644 0.091766 1.081764 0.951541 1
2016-03-03 0.279760 -0.316136 1.198073 -0.562947 1
2016-03-04 1.174777 -0.225305 -0.280256 -0.074768 NaN

此时的E列有两个空值

df1.dropna()  #去除了空值所在的行
df1.fillna(value=5) #将空值替换成5
pd.isnull(df1) #判断是否有空数据
pd.isnull(df1).any()#列中是否有空数据
pd.isnull(df1).any().any()#表中是否有空数据

统计

空数据是不参与统计计算的

df1.mean()  #各列的平均值
df.sum(axis='columns') #计算加和
df.cumsum() #计算累加值
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)  #shift表示数据段整体往后移动两个位置,前面空出来的位置为NaN
df.sub(s, axis='index')#每个列都减处理,控制行变成空值,一般都是对空值做预处理在减
df.apply(np.cumsum) #效果同df.cumsum(),注意这里的sumsum属于np模块,不是pd模块,apply是吧一个列作为参数传给函数,applymap是把每个元素做处理
df.apply(lambda x: x.max() - x.min()) #得出每一列最大值减最小值
s = pd.Series(np.random.randint(0, 7, size=10)) ¥不包括最后一个数,但python自带函数是包括的
s.value_counts() #统计各种数字的数量
s.mode()         #最多的数是哪些

数据合并

df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
A B C D
0 1.098103 -0.843356 -0.379135 0.419353
1 -0.177702 -0.225926 -0.363542 -0.153022
2 1.938231 0.154881 0.291382 0.152774
3 -0.460645 -0.268697 -1.509469 0.698776
4 -0.397048 -0.958223 0.212833 -0.435485
5 0.525406 -0.177595 0.453216 -0.093792
6 0.531912 -0.832667 0.200721 0.943878
7 -0.740845 0.098634 0.274020 1.671997
8 2.182379 1.729010 1.306269 0.580677
9 -0.031538 0.159714 0.736667 -0.122326
df1 = pd.concat([df.iloc[:3], df.iloc[3:7], df.iloc[7:]])
(df == df1).all().all()  #判断两个矩阵是否相等# SQL 样式的联合查询
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
# SELECT * FROM left INNER JOIN right ON left.key = right.key;
pd.merge(left, right, on='key')s = pd.Series(np.random.randint(1, 5, size=4), index=list('ABCD'))
df.append(s, ignore_index=True)

如果s有五列,在插入后,多出来新的一列,除了最后一行以外,其他都是NaN

分组统计

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
df
A B C D
0 foo one -0.580320 -1.460149
1 bar one 1.471201 -1.079598
2 foo two 0.094836 1.513204
3 bar three -1.498810 0.754968
4 foo two 0.180709 0.415266
5 bar two 0.358515 -0.341988
6 foo one -0.121082 -0.408148
7 foo three 0.404648 -0.320882
df.groupby('A').sum()
C D
A
bar 0.330906 -0.666618
foo -0.021208 -0.260709
df.groupby(['A', 'B']).sum()
C D
A B
bar one 1.471201 -1.079598
three -1.498810 0.754968
two 0.358515 -0.341988
foo one -0.701402 -1.868297
three 0.404648 -0.320882
two 0.275545 1.928470

形成了双索引的结构

数据整形

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
A B
first second
bar one 0.072026 0.422077
two -1.099181 -0.354796
baz one 1.285500 -1.185525
two 0.645316 -0.660115
foo one 0.696443 -1.664527
two 0.718399 -0.154125
qux one -0.740052 0.713089
two -0.672748 -1.346843
stacked = df.stack()
first  second
bar    one     A    0.072026B    0.422077two     A   -1.099181B   -0.354796
baz    one     A    1.285500B   -1.185525two     A    0.645316B   -0.660115
foo    one     A    0.696443B   -1.664527two     A    0.718399B   -0.154125
qux    one     A   -0.740052B    0.713089two     A   -0.672748B   -1.346843
dtype: float64
stacked.unstack()  #装换成以前的样子了
stacked.unstack().unstack()
A B
second one two one two
first
bar 0.072026 -1.099181 0.422077 -0.354796
baz 1.285500 0.645316 -1.185525 -0.660115
foo 0.696443 0.718399 -1.664527 -0.154125
qux -0.740052 -0.672748 0.713089 -1.346843
#可以指明选择哪一列作为列索引
stacked.unstack(1)
<div>
<table border="1" class="dataframe"><thead><tr style="text-align: right;"><th></th><th>second</th><th>one</th><th>two</th></tr><tr><th>first</th><th></th><th></th><th></th></tr></thead><tbody><tr><th rowspan="2" valign="top">bar</th><th>A</th><td>0.072026</td><td>-1.099181</td></tr><tr><th>B</th><td>0.422077</td><td>-0.354796</td></tr><tr><th rowspan="2" valign="top">baz</th><th>A</th><td>1.285500</td><td>0.645316</td></tr><tr><th>B</th><td>-1.185525</td><td>-0.660115</td></tr><tr><th rowspan="2" valign="top">foo</th><th>A</th><td>0.696443</td><td>0.718399</td></tr><tr><th>B</th><td>-1.664527</td><td>-0.154125</td></tr><tr><th rowspan="2" valign="top">qux</th><th>A</th><td>-0.740052</td><td>-0.672748</td></tr><tr><th>B</th><td>0.713089</td><td>-1.346843</td></tr></tbody>
</table>
</div>

数据透视表

数据透视是指查看里面的一部分数据

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,'B' : ['A', 'B', 'C'] * 4,'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,'D' : np.random.randn(12),'E' : np.random.randn(12)})
df
A B C D E
0 one A foo 1.477533 1.557713
1 one B foo 0.019528 2.483014
2 two C foo -0.912452 0.409732
3 three A bar 0.502807 -0.462401
4 one B bar 1.709597 -1.739413
5 one C bar -0.658155 1.302735
6 two A foo 0.007806 0.782926
7 three B foo -0.067922 -0.193820
8 one C foo 0.806713 0.383870
9 one A bar 0.794017 0.749756
10 two B bar -0.532554 -0.811900
11 three C bar 0.464731 1.168423
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C bar foo
A B
one A 0.794017 1.477533
B 1.709597 0.019528
C -0.658155 0.806713
three A 0.502807 NaN
B NaN -0.067922
C 0.464731 NaN
two A NaN 0.007806
B -0.532554 NaN
C NaN -0.912452

**当这个数据透视表对应多个值时,回去取平均值**

时间序列

rng = pd.date_range('20160301', periods=600, freq='s')
rng
DatetimeIndex(['2016-03-01 00:00:00', '2016-03-01 00:00:01','2016-03-01 00:00:02', '2016-03-01 00:00:03','2016-03-01 00:00:04', '2016-03-01 00:00:05','2016-03-01 00:00:06', '2016-03-01 00:00:07','2016-03-01 00:00:08', '2016-03-01 00:00:09',...'2016-03-01 00:09:50', '2016-03-01 00:09:51','2016-03-01 00:09:52', '2016-03-01 00:09:53','2016-03-01 00:09:54', '2016-03-01 00:09:55','2016-03-01 00:09:56', '2016-03-01 00:09:57','2016-03-01 00:09:58', '2016-03-01 00:09:59'],dtype='datetime64[ns]', length=600, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
2016-03-01 00:00:00     34
2016-03-01 00:00:01      4
2016-03-01 00:00:02    382
2016-03-01 00:00:03    164
2016-03-01 00:00:04    178
2016-03-01 00:00:05    421
2016-03-01 00:00:06     34
2016-03-01 00:00:07     71
2016-03-01 00:00:08    316
2016-03-01 00:00:09    201
2016-03-01 00:00:10    214
2016-03-01 00:00:11    443
2016-03-01 00:00:12    185
2016-03-01 00:00:13     79
2016-03-01 00:00:14     38
2016-03-01 00:00:15    465
2016-03-01 00:00:16    309
2016-03-01 00:00:17     93
2016-03-01 00:00:18     20
2016-03-01 00:00:19    338
2016-03-01 00:00:20    149
2016-03-01 00:00:21     34
2016-03-01 00:00:22    257
2016-03-01 00:00:23    462
2016-03-01 00:00:24     41
2016-03-01 00:00:25    471
2016-03-01 00:00:26    313
2016-03-01 00:00:27    224
2016-03-01 00:00:28     78
2016-03-01 00:00:29    498...
2016-03-01 00:09:30     61
2016-03-01 00:09:31    315
2016-03-01 00:09:32    388
2016-03-01 00:09:33    391
2016-03-01 00:09:34    263
2016-03-01 00:09:35     11
2016-03-01 00:09:36     61
2016-03-01 00:09:37    400
2016-03-01 00:09:38    109
2016-03-01 00:09:39    135
2016-03-01 00:09:40    267
2016-03-01 00:09:41    248
2016-03-01 00:09:42    469
2016-03-01 00:09:43    155
2016-03-01 00:09:44    284
2016-03-01 00:09:45    168
2016-03-01 00:09:46    228
2016-03-01 00:09:47    244
2016-03-01 00:09:48    442
2016-03-01 00:09:49    450
2016-03-01 00:09:50    226
2016-03-01 00:09:51    370
2016-03-01 00:09:52    192
2016-03-01 00:09:53    325
2016-03-01 00:09:54     82
2016-03-01 00:09:55    154
2016-03-01 00:09:56    285
2016-03-01 00:09:57     22
2016-03-01 00:09:58     48
2016-03-01 00:09:59    171
Freq: S, dtype: int32
#重新采样
ts.resample('2Min', how='sum')
2016-03-01 00:00:00    28595
2016-03-01 00:02:00    29339
2016-03-01 00:04:00    28991
2016-03-01 00:06:00    30789
2016-03-01 00:08:00    30131
Freq: 2T, dtype: int32
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2','1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4','1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2','1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4','1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2','1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4','1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2','2000Q3', '2000Q4'],dtype='int64', freq='Q-NOV')
prng.to_timestamp()   #转成标准格式数据 年-月-日
pd.Timestamp('20160301')-pd.Timestamp('2016') #计算时间差

类别数据

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
df["grade"] = df["raw_grade"].astype("category")
df
id raw_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 e e
df["grade"].cat.categories
Index([u'a', u'b', u'e'], dtype='object')
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad
df.sort_values(by='grade', ascending=True)
id raw_grade grade
0 1 a very good
3 4 a very good
4 5 a very good
1 2 b good
2 3 b good
5 6 e very bad

画图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('20000101', periods=1000))
ts = ts.cumsum()
ts
2000-01-01     0.416424
2000-01-02     0.603304
2000-01-03    -0.237965
2000-01-04     0.317450
2000-01-05     0.665045
2000-01-06     2.468087
2000-01-07     2.758852
2000-01-08     2.271343
2000-01-09     3.129609
2000-01-10     5.171241
2000-01-11     5.049896
2000-01-12     5.185316
2000-01-13     4.169058
2000-01-14     2.862306
2000-01-15     4.018617
2000-01-16     4.456694
2000-01-17     5.824236
2000-01-18     6.094983
2000-01-19     5.880954
2000-01-20     5.875111
2000-01-21     6.008481
2000-01-22     6.835501
2000-01-23     7.480405
2000-01-24     6.849335
2000-01-25     7.608887
2000-01-26     9.029474
2000-01-27     8.859222
2000-01-28     7.162806
2000-01-29     7.398013
2000-01-30     7.391844...
2002-08-28    21.728409
2002-08-29    21.757852
2002-08-30    21.047643
2002-08-31    20.114996
2002-09-01    18.769902
2002-09-02    17.417680
2002-09-03    17.917688
2002-09-04    18.064786
2002-09-05    19.312356
2002-09-06    18.633479
2002-09-07    17.711879
2002-09-08    19.162369
2002-09-09    19.697896
2002-09-10    18.895018
2002-09-11    18.590989
2002-09-12    17.278925
2002-09-13    17.730168
2002-09-14    19.058526
2002-09-15    18.898382
2002-09-16    17.048621
2002-09-17    16.443233
2002-09-18    16.842284
2002-09-19    14.627031
2002-09-20    15.500982
2002-09-21    14.640444
2002-09-22    13.183795
2002-09-23    13.383657
2002-09-24    13.006229
2002-09-25    12.311008
2002-09-26    11.674804
Freq: D, dtype: float64
ts.plot()

数据读写

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df
A B C D
0 -1.052421 -0.164992 3.098604 -0.966960
1 1.194177 0.086880 0.496095 0.265308
2 0.297724 1.284297 -0.130855 -0.229570
3 -0.787063 0.553680 0.546853 -0.322599
4 0.033174 -1.222281 0.320090 -1.749333
5 0.109575 0.310684 1.620296 -0.928869
6 0.761408 -0.027630 0.458341 -0.785370
7 -1.150479 -0.718584 1.028866 0.419026
8 -2.906881 -0.295700 -0.342306 -0.765172
9 0.916363 -1.181429 -1.559657 -1.171191
10 0.578659 0.804726 1.299496 0.176843
11 0.150659 -0.162833 -1.086055 1.240432
12 -0.819219 1.668234 0.217604 -0.779170
13 -0.550658 -0.672640 -0.674157 -0.637602
14 0.901584 0.046023 0.244370 0.374293
15 0.971181 -0.442618 0.179083 0.086095
16 -0.570786 -1.019239 1.684833 0.539140
17 -1.432314 1.369588 2.091300 0.733526
18 -1.115526 -0.115884 2.636074 -0.788859
19 1.601554 1.226182 0.169308 -0.616585
20 0.571316 0.542432 0.306595 0.780939
21 -0.540414 1.036656 0.683224 -0.116963
22 1.319110 -1.265207 1.371924 0.881560
23 1.584346 -1.719633 -1.365020 -0.617224
24 -0.440420 -0.799265 0.376128 -0.654581
25 -0.261730 -0.046325 -0.289009 0.505634
26 0.385047 0.112723 0.428345 -0.008455
27 -0.921668 1.609848 1.592532 -0.623103
28 0.280799 -0.231821 -1.589829 -1.791286
29 0.661562 0.621305 0.921586 -0.312834
... ... ... ... ...
70 0.064385 0.669585 -1.347073 0.941348
71 -1.534420 -1.227736 0.459771 -1.150254
72 0.010741 0.062820 -1.098301 1.268482
73 -1.183586 1.159889 -0.186617 -0.847210
74 -0.705815 -0.371896 0.313020 0.035314
75 -2.945315 -0.421227 -0.403479 1.387825
76 -0.122383 0.474282 -2.039155 -0.155960
77 0.921353 -0.430436 -0.599253 0.911030
78 0.018444 0.098611 0.320480 0.001282
79 -0.188301 -2.015690 -0.427172 -0.146939
80 -0.006022 0.213421 1.358382 -0.414890
81 0.596546 0.042708 1.325342 -0.800222
82 -1.736245 -0.056213 -0.415892 -0.360570
83 0.463591 -0.404202 0.577191 0.336023
84 -1.397557 0.442012 0.007915 -1.305628
85 -0.137766 -0.771713 0.200956 -0.365344
86 0.988833 -0.165965 -0.893573 -0.318324
87 1.093799 1.694406 -0.868420 0.100202
88 -0.240628 0.539268 -1.094841 1.737569
89 1.850923 -0.472270 -2.317345 -0.544395
90 0.617284 1.224130 -1.722366 0.236574
91 1.282967 0.738570 1.748848 -0.106646
92 0.775707 -0.494293 -1.098466 0.372206
93 -0.846466 0.735144 1.456520 1.622817
94 -0.860999 1.146650 -1.064013 1.400919
95 -0.095498 -1.849518 2.303532 0.688425
96 -0.017921 -0.558700 -1.061605 0.781250
97 -1.069070 1.106837 -1.936800 -0.782616
98 0.436267 0.463537 0.614982 -0.123774
99 -1.440635 -1.506836 -0.386824 1.118260

100 rows × 4 columns

df.to_csv('data.csv')
%ls
%more data.csv
# pd.read_csv('data.csv')
pd.read_csv('data.csv', index_col=0)#以第一列作为索引

电影数据分析

准备工作

从网站 grouplens.org/datasets/movielens 下载 MovieLens 1M Dataset 数据。

数据说明

参阅数据介绍文件 README.txt

利用 Pandas 分析电影评分数据

  • 数据读取
  • 数据合并
  • 统计电影平均得分
  • 统计活跃电影 -> 获得评分的次数越多说明电影越活跃
  • 女生最喜欢的电影排行榜
  • 男生最喜欢的电影排行榜
  • 男女生评分差距最大的电影 -> 某类电影女生喜欢,但男生不喜欢
  • 最具争议的电影排行榜 -> 评分的方差最大
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
user_names = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=user_names, engine='python')rating_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rating_names, engine='python')movie_names = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=movie_names, engine='python')

pandas内部有两种实现方式,一种是C语言,一种是python,这里指定python,因为C语言功能不如python多。

print len(users)
users.head(5)
6040
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455

数据合并

data = pd.merge(pd.merge(users, ratings), movies)
len(data)
data.head(5)
user_id gender age occupation zip movie_id rating timestamp title genres
0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 M 56 16 70072 1193 5 978298413 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 M 25 12 32793 1193 4 978220179 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 M 25 7 22903 1193 4 978199279 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 M 50 1 95350 1193 5 978158471 One Flew Over the Cuckoo's Nest (1975) Drama
data[data.user_id == 1]  #查看你用户1对所有电影的评分情况
# 按性别查看各个电影的平均评分
mean_ratings_gender = data.pivot_table(values='rating', index='title', columns='gender', aggfunc='mean')
mean_ratings_gender.head(5)
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
# 男女意见想差最大的电影 -> 价值观/品味冲突
mean_ratings_gender['diff'] = mean_ratings_gender.F - mean_ratings_gender.M
mean_ratings_gender.head(5)
gender F M diff
title
$1,000,000 Duck (1971) 3.375000 2.761905 0.613095
'Night Mother (1986) 3.388889 3.352941 0.035948
'Til There Was You (1997) 2.675676 2.733333 -0.057658
'burbs, The (1989) 2.793478 2.962085 -0.168607
...And Justice for All (1979) 3.828571 3.689024 0.139547
mean_ratings_gender.sort_values(by='diff', ascending=True).head(10)
gender F M diff
title
Tigrero: A Film That Was Never Made (1994) 1 4.333333 -3.333333
Neon Bible, The (1995) 1 4.000000 -3.000000
Enfer, L' (1994) 1 3.750000 -2.750000
Stalingrad (1993) 1 3.593750 -2.593750
Killer: A Journal of Murder (1995) 1 3.428571 -2.428571
Dangerous Ground (1997) 1 3.333333 -2.333333
In God's Hands (1998) 1 3.333333 -2.333333
Rosie (1998) 1 3.333333 -2.333333
Flying Saucer, The (1950) 1 3.300000 -2.300000
Jamaica Inn (1939) 1 3.142857 -2.142857
# 活跃电影排行榜
ratings_by_movie_title = data.groupby('title').size()
ratings_by_movie_title.head(5)
title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64
# 前二十大高分电影 -> 平均评分最高的电影
mean_ratings = data.pivot_table(values='rating', index='title', aggfunc='mean')
top_20_mean_ratings = mean_ratings.sort_values(ascending=False).head(20)
top_20_mean_ratings
title
Gate of Heavenly Peace, The (1995)                                     5.000000
Lured (1947)                                                           5.000000
Ulysses (Ulisse) (1954)                                                5.000000
Smashing Time (1967)                                                   5.000000
Follow the Bitch (1998)                                                5.000000
Song of Freedom (1936)                                                 5.000000
Bittersweet Motel (2000)                                               5.000000
Baby, The (1973)                                                       5.000000
One Little Indian (1973)                                               5.000000
Schlafes Bruder (Brother of Sleep) (1995)                              5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                                    4.800000
Lamerica (1994)                                                        4.750000
Apple, The (Sib) (1998)                                                4.666667
Sanjuro (1962)                                                         4.608696
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)    4.560510
Shawshank Redemption, The (1994)                                       4.554558
Godfather, The (1972)                                                  4.524966
Close Shave, A (1995)                                                  4.520548
Usual Suspects, The (1995)                                             4.517106
Schindler's List (1993)                                                4.510417
Name: rating, dtype: float64
# 前十大热闹电影的平均评分 -> 不一定越热闹的电影,评分越高
mean_ratings[top_10_ratings.index]
title
American Beauty (1999)                                   4.317386
Star Wars: Episode IV - A New Hope (1977)                4.453694
Star Wars: Episode V - The Empire Strikes Back (1980)    4.292977
Star Wars: Episode VI - Return of the Jedi (1983)        4.022893
Jurassic Park (1993)                                     3.763847
Saving Private Ryan (1998)                               4.337354
Terminator 2: Judgment Day (1991)                        4.058513
Matrix, The (1999)                                       4.315830
Back to the Future (1985)                                3.990321
Silence of the Lambs, The (1991)                         4.351823
Name: rating, dtype: float64
# 前二十大高分电影的热闹程度 -> 不一定评分越高的电影越热闹,可能某个很小众的电影看得人少,但评分很高
ratings_by_movie_title[top_20_mean_ratings.index]
title
Gate of Heavenly Peace, The (1995)                                        3
Lured (1947)                                                              1
Ulysses (Ulisse) (1954)                                                   1
Smashing Time (1967)                                                      2
Follow the Bitch (1998)                                                   1
Song of Freedom (1936)                                                    1
Bittersweet Motel (2000)                                                  1
Baby, The (1973)                                                          1
One Little Indian (1973)                                                  1
Schlafes Bruder (Brother of Sleep) (1995)                                 1
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                                       5
Lamerica (1994)                                                           8
Apple, The (Sib) (1998)                                                   9
Sanjuro (1962)                                                           69
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)     628
Shawshank Redemption, The (1994)                                       2227
Godfather, The (1972)                                                  2223
Close Shave, A (1995)                                                   657
Usual Suspects, The (1995)                                             1783
Schindler's List (1993)                                                2304
dtype: int64
# 十大好电影 -> 活跃度超过 1000 的高分电影
top_10_movies = mean_ratings[top_ratings.index].sort_values(ascending=False).head(10)
top_10_movies
title
Shawshank Redemption, The (1994)                                               4.554558
Godfather, The (1972)                                                          4.524966
Usual Suspects, The (1995)                                                     4.517106
Schindler's List (1993)                                                        4.510417
Raiders of the Lost Ark (1981)                                                 4.477725
Rear Window (1954)                                                             4.476190
Star Wars: Episode IV - A New Hope (1977)                                      4.453694
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)    4.449890
Casablanca (1942)                                                              4.412822
Sixth Sense, The (1999)                                                        4.406263
Name: rating, dtype: float64

pandas核心数据结构

import pandas as pd
import numpy as np

Series

Series 是一维带标签的数组,数组里可以放任意的数据(整数,浮点数,字符串,Python Object)。其基本的创建函数是:

s = pd.Series(data, index=index)

其中 index 是一个列表,用来作为数据的标签。data 可以是不同的数据类型:

  • Python 字典
  • ndarray 对象
  • 一个标量值,如 5

从 ndaray 创建

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
a    0.747292
b   -1.120276
c   -0.132692
d   -0.267813
e   -0.590904
dtype: float64
s.index
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
s = pd.Series(np.random.randn(5))
s
0    0.324214
1   -0.183776
2   -0.518808
3    0.866421
4   -0.601668
dtype: float64
s.index
Int64Index([0, 1, 2, 3, 4], dtype='int64')

从字典创建

# 空值的默认处理
d = {'a' : 0., 'b' : 1., 'd' : 3}
s = pd.Series(d, index=list('abcd'))
s
a     0
b     1
c   NaN
d     3
dtype: float64

从标量创建

pd.Series(3, index=list('abcde'))
a    3
b    3
c    3
d    3
e    3
dtype: int64

Series特性

Series 是类 ndarray 对象

熟悉 numpy 的同学对下面的操作应该不会陌生。我们在 numpy 简介里也介绍过下面的索引方式。

s = pd.Series(np.random.randn(5))
s
0    0.882069
1   -0.134360
2   -0.925088
3    0.191072
4    2.546704
dtype: float64
s[0]
0.88206876023157332
s[:3]
0    0.882069
1   -0.134360
2   -0.925088
dtype: float64
s[[1, 3, 4]]
1   -0.134360
3    0.191072
4    2.546704
dtype: float64
np.exp(s)
0     2.415892
1     0.874275
2     0.396497
3     1.210546
4    12.764963
dtype: float64
np.sin(s)
0    0.772055
1   -0.133957
2   -0.798673
3    0.189911
4    0.560416
dtype: float64

Series 是类字典对象

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
a   -2.149840
b   -0.924115
c    0.481231
d    1.033813
e   -0.462794
dtype: float64
s['a']
-2.1498403551053218
s['e'] = 5
s
a   -2.149840
b   -0.924115
c    0.481231
d    1.033813
e    5.000000
dtype: float64
s['g'] = 100
s
a     -2.149840
b     -0.924115
c      0.481231
d      1.033813
e      5.000000
g    100.000000
dtype: float64
'e' in s
True
'f' in s
False
# s['f']
print s.get('f')
None
print s.get('f', np.nan)
nan

标签对齐操作

s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'e'])
s2 = pd.Series(np.random.randn(3), index=['a', 'd', 'e'])
print '{0}\n\n{1}'.format(s1, s2)
a   -0.917905
c   -0.744616
e    0.114522
dtype: float64a    0.721087
d   -0.471575
e    0.796093
dtype: float64
s1 + s2

name 属性

s = pd.Series(np.random.randn(5), name='Some Thing')
s
0    0.623787
1    0.517239
2    1.551314
3    1.414463
4   -1.224611
Name: Some Thing, dtype: float64
s.name
'Some Thing'a   -0.196818
c         NaN
d         NaN
e    0.910615
dtype: float64

DataFrame

DataFrame 是二维带行标签和列标签的数组。可以把 DataFrame 想你成一个 Excel 表格或一个 SQL 数据库的表格,还可以相像成是一个 Series 对象字典。它是 Pandas 里最常用的数据结构。

创建 DataFrame 的基本格式是:

df = pd.DataFrame(data, index=index, columns=columns)

其中 index 是行标签,columns 是列标签,data 可以是下面的数据:

  • 由一维 numpy 数组,list,Series 构成的字典
  • 二维 numpy 数组
  • 一个 Series
  • 另外的 DataFrame 对象
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)
one two
a 1 1
b 2 2
c 3 3
d NaN 4
pd.DataFrame(d, index=['d', 'b', 'a'])
one two
d NaN 4
b 2 2
a 1 1
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
two three
d 4 NaN
b 2 NaN
a 1 NaN

如果是列表,必须元素个数一样,否则会报错

d = {'one' : [1, 2, 3, 4],'two' : [21, 22, 23, 24]}
pd.DataFrame(d)
one two
0 1 21
1 2 22
2 3 23
3 4 24
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
one two
a 1 21
b 2 22
c 3 23
d 4 24

从结构化数据中创建

data = [(1, 2.2, 'Hello'), (2, 3., "World")]
pd.DataFrame(data)
0 1 2
0 1 2.2 Hello
1 2 3.0 World
pd.DataFrame(data, index=['first', 'second'], columns=['A', 'B', 'C'])
A B C
first 1 2.2 Hello
second 2 3.0 World

从字典列表创建

data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data)
a b c
0 1 2 NaN
1 5 10 20

```python pd.DataFrame(data, index=['first', 'second']) ```

a b c
first 1 2 NaN
second 5 10 20
pd.DataFrame(data, columns=['a', 'b'])
a b
0 1 2
1 5 10

从元组字典创建

了解其创建的原理,实际应用中,会通过数据清洗的方式,把数据整理成方便 Pandas 导入且可读性好的格式。最后再通过 reindex/groupby 等方式转换成复杂数据结构。

d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}
# 多级标签
pd.DataFrame(d)
a b
a b c a b
A B 4 1 5 8 10
C 3 2 6 7 NaN
D NaN NaN NaN NaN 9

从 Series 创建

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
pd.DataFrame(s)
0
a -0.789343
b 0.127384
c 1.084005
d -0.755011
e -0.963299
pd.DataFrame(s, index=['a', 'c', 'd'])
0
a -0.789343
c 1.084005
d -0.755011

如果这里指定多列,就会报错,不同于字典作为参数进行创建

pd.DataFrame(s, index=['a', 'c', 'd'], columns=['A'])
A
a -0.789343
c 1.084005
d -0.755011

列选择/增加/删除

df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'])
df
one two three four
0 2.045300 -0.981722 -0.656081 -0.639517
1 -0.550780 0.248781 -0.146424 0.217392
2 1.702775 0.103998 -0.662138 -0.534071
3 -2.035681 0.015025 1.368209 0.178378
4 -1.092208 0.091108 -0.892496 -0.611198
5 0.093502 0.267428 1.189654 -0.258723
df['one']
0    2.045300
1   -0.550780
2    1.702775
3   -2.035681
4   -1.092208
5    0.093502
Name: one, dtype: float64
df['three'] = df['one'] + df['two']
df
one two three four
0 2.045300 -0.981722 1.063578 -0.639517
1 -0.550780 0.248781 -0.301999 0.217392
2 1.702775 0.103998 1.806773 -0.534071
3 -2.035681 0.015025 -2.020656 0.178378
4 -1.092208 0.091108 -1.001100 -0.611198
5 0.093502 0.267428 0.360931 -0.258723
df['flag'] = df['one'] > 0
df
one two three four flag
0 2.045300 -0.981722 1.063578 -0.639517 True
1 -0.550780 0.248781 -0.301999 0.217392 False
2 1.702775 0.103998 1.806773 -0.534071 True
3 -2.035681 0.015025 -2.020656 0.178378 False
4 -1.092208 0.091108 -1.001100 -0.611198 False
5 0.093502 0.267428 0.360931 -0.258723 True
del df['three']
df
one two four flag
0 2.045300 -0.981722 -0.639517 True
1 -0.550780 0.248781 0.217392 False
2 1.702775 0.103998 -0.534071 True
3 -2.035681 0.015025 0.178378 False
4 -1.092208 0.091108 -0.611198 False
5 0.093502 0.267428 -0.258723 True
four = df.pop('four')
four
0   -0.639517
1    0.217392
2   -0.534071
3    0.178378
4   -0.611198
5   -0.258723
Name: four, dtype: float64
df
one two flag
0 2.045300 -0.981722 True
1 -0.550780 0.248781 False
2 1.702775 0.103998 True
3 -2.035681 0.015025 False
4 -1.092208 0.091108 False
5 0.093502 0.267428 True
df['five'] = 5
df
one two flag five
0 2.045300 -0.981722 True 5
1 -0.550780 0.248781 False 5
2 1.702775 0.103998 True 5
3 -2.035681 0.015025 False 5
4 -1.092208 0.091108 False 5
5 0.093502 0.267428 True 5
df['one_trunc'] = df['one'][:2]
df
one two flag five one_trunc
0 2.045300 -0.981722 True 5 2.04530
1 -0.550780 0.248781 False 5 -0.55078
2 1.702775 0.103998 True 5 NaN
3 -2.035681 0.015025 False 5 NaN
4 -1.092208 0.091108 False 5 NaN
5 0.093502 0.267428 True 5 NaN
# 指定插入位置
df.insert(1, 'bar', df['one'])
df
one bar two flag five one_trunc
0 2.045300 2.045300 -0.981722 True 5 2.04530
1 -0.550780 -0.550780 0.248781 False 5 -0.55078
2 1.702775 1.702775 0.103998 True 5 NaN
3 -2.035681 -2.035681 0.015025 False 5 NaN
4 -1.092208 -1.092208 0.091108 False 5 NaN
5 0.093502 0.093502 0.267428 True 5 NaN

使用 assign() 方法来插入新列

更方便地使用 methd chains 的方法来实现

df = pd.DataFrame(np.random.randint(1, 5, (6, 4)), columns=list('ABCD'))
df
A B C D
0 4 3 3 4
1 1 4 4 2
2 1 4 4 3
3 2 4 4 3
4 1 2 4 2
5 3 4 1 4
df.assign(Ratio = df['A'] / df['B'])
A B C D Ratio
0 4 3 3 4 1.333333
1 1 4 4 2 0.250000
2 1 4 4 3 0.250000
3 2 4 4 3 0.500000
4 1 2 4 2 0.500000
5 3 4 1 4 0.750000

insert是直接改变原DataFrame,assign是生成一个新的,assign的另一个特性是可以传入函数进行计算。

df.assign(AB_Ratio = lambda x: x.A / x.B, CD_Ratio = lambda x: x.C - x.D)
A B C D AB_Ratio CD_Ratio
0 4 3 3 4 1.333333 -1
1 1 4 4 2 0.250000 2
2 1 4 4 3 0.250000 1
3 2 4 4 3 0.500000 1
4 1 2 4 2 0.500000 2
5 3 4 1 4 0.750000 -3
df.assign(AB_Ratio = lambda x: x.A / x.B).assign(ABD_Ratio = lambda x: x.AB_Ratio * x.D)
A B C D AB_Ratio ABD_Ratio
0 4 3 3 4 1.333333 5.333333
1 1 4 4 2 0.250000 0.500000
2 1 4 4 3 0.250000 0.750000
3 2 4 4 3 0.500000 1.500000
4 1 2 4 2 0.500000 1.000000
5 3 4 1 4 0.750000 3.000000

索引和选择

对应的操作,语法和返回结果

  • 选择一列 -> df[col] -> Series
  • 根据行标签选择一行 -> df.loc[label] -> Series
  • 根据行位置选择一行 -> df.iloc[label] -> Series
  • 选择多行 -> df[5:10] -> DataFrame
  • 根据布尔向量选择多行 -> df[bool_vector] -> DataFrame
df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index=list('abcdef'), columns=list('ABCD'))
df
A B C D
a 2 2 6 6
b 8 3 5 7
c 4 6 8 3
d 7 8 3 9
e 8 4 4 2
f 4 2 4 3
df['A']
a    2
b    8
c    4
d    7
e    8
f    4
Name: A, dtype: int32
df.loc['a']
A    2
B    2
C    6
D    6
Name: a, dtype: int32
df.iloc[0]
A    2
B    2
C    6
D    6
Name: a, dtype: int32
df[1:4]
A B C D
b 8 3 5 7
c 4 6 8 3
d 7 8 3 9

使用iloc比这样直接访问效率要高df.iloc[1:4]

df[[False, True, True, False, True, False]]
A B C D
b 8 3 5 7
c 4 6 8 3
e 8 4 4 2

数据对齐

DataFrame 在进行数据计算时,会自动按行和列进行数据对齐。最终的计算结果会合并两个 DataFrame。

df1 = pd.DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), columns=['A', 'B', 'C', 'D'])
df1
A B C D
a 0.576428 -0.037913 -0.329787 -1.752916
b 0.406743 -1.044561 -0.724447 0.374599
c 0.073578 0.423914 -1.499770 -0.488374
d -0.377609 1.137422 -1.951169 -0.814306
e -2.171648 -2.364502 -0.833594 0.168636
f -1.134800 -0.927469 0.886889 0.542603
g 0.625104 0.115953 -1.282609 1.031292
h 0.403509 0.263207 0.403614 -0.177888
i 0.148494 -2.034253 0.134859 -0.960650
j 0.094200 -1.803288 0.057472 -0.338958
df2 = pd.DataFrame(np.random.randn(7, 3), index=list('cdefghi'), columns=['A', 'B', 'C'])
df2
A B C
c 0.884518 0.337344 -1.072027
d 0.264036 -0.152542 -0.225544
e 1.048813 -1.496442 1.022348
f 0.895314 -0.890236 1.230465
g -0.588162 -0.492354 -0.739563
h -2.580322 1.104810 -0.167137
i -0.842738 0.171735 0.847714
df1 + df2
A B C D
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c 0.958096 0.761259 -2.571797 NaN
d -0.113573 0.984880 -2.176713 NaN
e -1.122834 -3.860944 0.188754 NaN
f -0.239486 -1.817705 2.117354 NaN
g 0.036942 -0.376401 -2.022171 NaN
h -2.176813 1.368016 0.236476 NaN
i -0.694245 -1.862517 0.982573 NaN
j NaN NaN NaN NaN
df1 - df1.iloc[0]
A B C D
a 0.000000 0.000000 0.000000 0.000000
b -0.169685 -1.006648 -0.394660 2.127515
c -0.502850 0.461827 -1.169983 1.264541
d -0.954037 1.175335 -1.621382 0.938610
e -2.748076 -2.326589 -0.503807 1.921551
f -1.711228 -0.889556 1.216676 2.295518
g 0.048676 0.153866 -0.952822 2.784208
h -0.172919 0.301119 0.733400 1.575028
i -0.427934 -1.996340 0.464646 0.792265
j -0.482228 -1.765375 0.387259 1.413957

这里两个运算对象不是同类型,用到广播原理,生成新的DataFrame,是原来DataFrame的每行减去第0行。

使用 numpy 函数

Pandas 与 numpy 在核心数据结构上是完全兼容的

df = pd.DataFrame(np.random.randn(10, 4), columns=['one', 'two', 'three', 'four'])
df
one two three four
0 -1.121818 1.233686 0.681618 -0.502204
1 1.469664 -0.060555 -0.044857 0.725021
2 1.219670 0.108709 1.806063 0.332685
3 -0.190615 1.244102 -0.863850 1.795335
4 -0.133109 -0.101591 0.818724 1.246230
5 0.729804 0.716593 2.472841 -0.078224
6 0.010136 1.725441 -1.071194 1.602945
7 1.002507 -1.122593 -0.147411 -1.678843
8 -0.550077 0.230777 -0.658470 -1.680395
9 1.006271 0.455683 -2.279833 -0.823792
np.exp(df)
one two three four
0 0.325687 3.433864 1.977073 0.605196
1 4.347774 0.941242 0.956134 2.064774
2 3.386069 1.114838 6.086440 1.394708
3 0.826450 3.469817 0.421536 6.021490
4 0.875369 0.903399 2.267604 3.477210
5 2.074675 2.047446 11.856082 0.924757
6 1.010187 5.614995 0.342599 4.967641
7 2.725105 0.325435 0.862939 0.186590
8 0.576905 1.259578 0.517643 0.186300
9 2.735382 1.577250 0.102301 0.438765
np.asarray(df) == df.values
array([[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True],[ True,  True,  True,  True]], dtype=bool)
type(np.asarray(df))
numpy.ndarray
np.asarray(df) == df
one two three four
0 True True True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True True True
9 True True True True

Panel

Panel 是三维带标签的数组。实际上,Pandas 的名称由来就是由 Panel 演进的,即 pan(el)-da(ta)-s。Panel 比较少用,但依然是最重要的基础数据结构之一。

  • items: 坐标轴 0,索引对应的元素是一个 DataFrame
  • major_axis: 坐标轴 1, DataFrame 里的行标签
  • minor_axis: 坐标轴 2, DataFrame 里的列标签
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),'Item2' : pd.DataFrame(np.random.randn(4, 2))}
pn = pd.Panel(data)
pn
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
pn['Item1']
0 1 2
0 0.638298 -1.600822 3.112210
1 0.394099 0.184129 0.438450
2 0.427692 -0.294556 0.039430
3 1.555046 0.933749 0.218616
pn.items
Index([u'Item1', u'Item2'], dtype='object')
pn.major_axis
Int64Index([0, 1, 2, 3], dtype='int64')
pn.minor_axis
Int64Index([0, 1, 2], dtype='int64')
# 函数调用
pn.major_xs(pn.major_axis[0])
Item1 Item2
0 0.638298 -1.427579
1 -1.600822 -0.778090
2 3.112210 NaN
# 函数调用
pn.minor_xs(pn.major_axis[1])
Item1 Item2
0 -1.600822 -0.778090
1 0.184129 0.698347
2 -0.294556 -0.167423
3 0.933749 0.205092
pn.to_frame()
Item1 Item2
major minor
0 0 0.638298 -1.427579
1 -1.600822 -0.778090
1 0 0.394099 -0.999929
1 0.184129 0.698347
2 0 0.427692 0.559905
1 -0.294556 -0.167423
3 0 1.555046 -1.992102
1 0.933749 0.205092

自动去除了NaN一列

基础运算

import pandas as pd
import numpy as np

重新索引

Series

s = pd.Series([1, 3, 5, 6, 8], index=list('acefh'))
s
a    1
c    3
e    5
f    6
h    8
dtype: int64
s.reindex(list('abcdefgh'))
a     1
b   NaN
c     3
d   NaN
e     5
f     6
g   NaN
h     8
dtype: float64
s.reindex(list('abcdefgh'), fill_value=0)
a    1
b    0
c    3
d    0
e    5
f    6
g    0
h    8
dtype: int64
# method='bfill'
s.reindex(list('abcdefgh'), method='ffill')
a    1
b    1
c    3
d    3
e    5
f    6
g    6
h    8
dtype: int64

DataFrame

df = pd.DataFrame(np.random.randn(4, 6), index=list('ADFH'), columns=['one', 'two', 'three', 'four', 'five', 'six'])
df
one two three four five six
A -0.049437 -0.526499 1.780662 1.154747 2.434957 -1.579278
D -0.075226 0.552163 -0.462732 -0.936051 -0.590041 0.484505
F 1.486168 0.725907 0.598127 -0.704809 -2.815687 -0.062462
H -0.900819 -0.177751 -0.232796 0.234088 -1.758574 1.255955
df2 = df.reindex(index=list('ABCDEFGH'))
df2

默认也可以不写index,但columns得写

one two three four five six
A -0.049437 -0.526499 1.780662 1.154747 2.434957 -1.579278
B NaN NaN NaN NaN NaN NaN
C NaN NaN NaN NaN NaN NaN
D -0.075226 0.552163 -0.462732 -0.936051 -0.590041 0.484505
E NaN NaN NaN NaN NaN NaN
F 1.486168 0.725907 0.598127 -0.704809 -2.815687 -0.062462
G NaN NaN NaN NaN NaN NaN
H -0.900819 -0.177751 -0.232796 0.234088 -1.758574 1.255955

reindex是拷贝一份数据出来

df.reindex(columns=['one', 'three', 'five', 'seven'])
one three five seven
A 100.000000 1.780662 2.434957 NaN
D -0.075226 -0.462732 -0.590041 NaN
F 1.486168 0.598127 -2.815687 NaN
H -0.900819 -0.232796 -1.758574 NaN
df.reindex(columns=['one', 'three', 'five', 'seven'], fill_value=0)
one three five seven
A 100.000000 1.780662 2.434957 0
D -0.075226 -0.462732 -0.590041 0
F 1.486168 0.598127 -2.815687 0
H -0.900819 -0.232796 -1.758574 0
# fill method 只对行有效
df.reindex(columns=['one', 'three', 'five', 'seven'], method='ffill')
one three five seven
A 100.000000 1.780662 2.434957 NaN
D -0.075226 -0.462732 -0.590041 NaN
F 1.486168 0.598127 -2.815687 NaN
H -0.900819 -0.232796 -1.758574 NaN
df.reindex(index=list('ABCDEFGH'), method='ffill')
one two three four five six
A 100.000000 -0.526499 1.780662 1.154747 2.434957 -1.579278
B 100.000000 -0.526499 1.780662 1.154747 2.434957 -1.579278
C 100.000000 -0.526499 1.780662 1.154747 2.434957 -1.579278
D -0.075226 0.552163 -0.462732 -0.936051 -0.590041 0.484505
E -0.075226 0.552163 -0.462732 -0.936051 -0.590041 0.484505
F 1.486168 0.725907 0.598127 -0.704809 -2.815687 -0.062462
G 1.486168 0.725907 0.598127 -0.704809 -2.815687 -0.062462
H -0.900819 -0.177751 -0.232796 0.234088 -1.758574 1.255955

丢弃部分数据

df = pd.DataFrame(np.random.randn(4, 6), index=list('ABCD'), columns=['one', 'two', 'three', 'four', 'five', 'six'])
df
one two three four five six
A -0.665415 -0.061367 0.075058 0.626415 -1.748458 -0.608540
B -1.455186 1.846691 0.234276 0.660298 -2.169835 -1.476485
C 0.322281 0.505378 0.198458 -0.831919 -0.630789 0.762524
D 0.703684 -0.827597 0.178063 0.108453 -0.418992 0.242912
df.drop('A')
one two three four five six
B -1.455186 1.846691 0.234276 0.660298 -2.169835 -1.476485
C 0.322281 0.505378 0.198458 -0.831919 -0.630789 0.762524
D 0.703684 -0.827597 0.178063 0.108453 -0.418992 0.242912
df2 = df.drop(['two', 'four'], axis=1)
df2
one three five six
A -0.665415 0.075058 -1.748458 -0.608540
B -1.455186 0.234276 -2.169835 -1.476485
C 0.322281 0.198458 -0.630789 0.762524
D 0.703684 0.178063 -0.418992 0.242912

drop是是拷贝一份数据出来的

广播运算

df = pd.DataFrame(np.arange(12).reshape(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df
A B C
one 0 1 2
two 3 4 5
three 6 7 8
four 9 10 11
df.loc['one']
A    0
B    1
C    2
Name: one, dtype: int64
df - df.loc['one']
A B C
one 0 0 0
two 3 3 3
three 6 6 6
four 9 9 9

函数应用

  • apply: 将数据按行或列进行计算
  • applymap: 将数据按元素为进行计算
df = pd.DataFrame(np.arange(12).reshape(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df
A B C
one 0 1 2
two 3 4 5
three 6 7 8
four 9 10 11
# 每一列作为一个 Series 作为参数传递给 lambda 函数
df.apply(lambda x: x.max() - x.min())
A    9
B    9
C    9
dtype: int64
# 每一行作为一个 Series 作为参数传递给 lambda 函数
df.apply(lambda x: x.max() - x.min(), axis=1)
one      2
two      2
three    2
four     2
dtype: int64
# 返回多个值组成的 Series
def min_max(x):return pd.Series([x.min(), x.max()], index=['min', 'max'])
df.apply(min_max, axis=1)
min max
one 0 2
two 3 5
three 6 8
four 9 11
# applymap: 逐元素运算
df = pd.DataFrame(np.random.randn(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df
A B C
one -1.126089 -0.286584 1.538841
two 1.804348 -0.709293 -0.400643
three -1.008037 -0.791648 0.388505
four -0.071827 0.659098 -0.505030
formater = '{0:.02f}'.format
# formater = lambda x: '%.02f' % x
df.applymap(formater)

0表示第0个参数,formater的写法也是一种函数

A B C
one -1.13 -0.29 1.54
two 1.80 -0.71 -0.40
three -1.01 -0.79 0.39
four -0.07 0.66 -0.51

排序和排名

df = pd.DataFrame(np.random.randint(1, 10, (4, 3)), index=list('ABCD'), columns=['one', 'two', 'three'])
df
one two three
A 7 8 5
B 2 6 4
C 4 5 4
D 6 2 1
df.sort_values(by='one')
one two three
B 2 6 4
C 4 5 4
D 6 2 1
A 7 8 5
s = pd.Series([3, 6, 2, 6, 4])
s.rank()
0    2.0
1    4.5
2    1.0
3    4.5
4    3.0
dtype: float64
s.rank(method='first', ascending=False)

method='first’表示并列时,先出现排前面
默认是average,还有一种是last

0    4
1    1
2    5
3    2
4    3
dtype: float64

DataFrame默认是按照每列进行排名,与sort_values不同

数据唯一性及成员资格

适用于 Series

s = pd.Series(list('abbcdabacad'))
s
0     a
1     b
2     b
3     c
4     d
5     a
6     b
7     a
8     c
9     a
10    d
dtype: object
s.unique()
array(['a', 'b', 'c', 'd'], dtype=object)
s.value_counts()
a    4
b    3
d    2
c    2
dtype: int64
s.isin(['a', 'b', 'c'])
0      True
1      True
2      True
3      True
4     False
5      True
6      True
7      True
8      True
9      True
10    False
dtype: bool

python数据科学包第二天相关推荐

  1. python 数据科学 包_什么时候应该使用哪个Python数据科学软件包?

    python 数据科学 包 Python is the most popular language for data science. Unfortunately, it can be tricky ...

  2. 看看这些鲜为人知的宝藏Python数据科学包吧!

    动态数据科学的这三剑客几乎无人不知无人不晓:Numpy,Pandas和Matplotlib.你可能已经熟悉这些包以及它们的运作方式. 还有其他很炫酷的包,你肯定也想试一试,例如Plotly,Seabo ...

  3. python数据科学包第三天(索引、分组计算、数据聚合、分组运算和转换、载入数据、日期范围、数据可视化)

    索引 行索引 列索引 索引的分类 重复索引的处理 s = pd.Series(np.random.rand(5), index=list('abcde')) s a 0.566924 b 0.6034 ...

  4. Python数据科学包(六)-----数据可视化和例子

    文章目录 一. 数据可视化 1. 线型图 2. 柱状图 3. 直方图 4. 密度图 5. 散布图 6. 饼图 7. 高级绘图 二. 股票数据分析 1. 分析波动幅度 2. 增长曲线 3. 增长倍数 4 ...

  5. python数据科学包(七)—— matplotlib实战之绘制球员能力图和股票K线图

    1.球员能力图 # -*- coding: utf-8 -*- import numpy as np import pandas as pd import matplotlib.pyplot as p ...

  6. 使用python构建向量空间_使用Docker构建Python数据科学容器

    人工智能(AI)和机器学习(ML)最近真的火了,并驱动了从自动驾驶汽车到药物发现等等应用领域的快速发展.AI和ML的前途一片光明. 另一方面,Docker通过引入临时轻量级容器彻底改变了计算世界.通过 ...

  7. python第二阶段(2)入门-数据科学包 pandas

    数据科学包 pandas 导入pandas 创建对象 1 系列 2 日期序列(1) 3 日期序列(2) 4 Series的操作(1) 5 Series的操作(2) 合并,新增,连接和比较 1 连接 2 ...

  8. python数据科学手册_小白入门Python数据科学

    前言 本文讲解了从零开始学习Python数据科学的全过程,涵盖各种工具和方法 你将会学习到如何使用python做基本的数据分析 你还可以了解机器学习算法的原理和使用 说明 先说一段题外话.我是一名数据 ...

  9. python 数据科学书籍_您必须在2020年阅读的数据科学书籍

    python 数据科学书籍 "We're entering a new world in which data may be more important than software.&qu ...

最新文章

  1. python移除链表元素
  2. ABAP-AVL-OO方法中的ALV的如何自己添加按钮及其响应
  3. Thinkpad在Windows8上热键的解决方案
  4. 【转】libpcap实现机制及接口函数
  5. Spring Boot笔记-解决前后端分离在开发时的跨域问题
  6. 设计模式(八)组合模式 Composite
  7. Python的实例方法,类方法,静态方法之间的区别及调用关系
  8. 阿里开源 iOS 协程开发框架 coobjc源码分析
  9. 13-微信小程序商城 产品简介布局(微信小程序商城开发、小程序毕业设计、小程序源代码)(黄菊华-微信小程序开发教程)
  10. 详述RFID服装智能管理方案
  11. 在上海、苏州、深圳、长沙从“蜗居”到“安家”,8090后要付出多少?
  12. 普度大学计算机科学博士,普渡大学计算机系 Yongle Zhang课题组招收全奖博士生...
  13. pytorch实现bert_精细调整bert和roberta以在pytorch中实现高精度文本分类
  14. Word 题注重新编号
  15. ssh远程登录命令简单实例
  16. 国一大佬也收藏的16个Python数据可视化案例(附源码)
  17. 035 导数 微分对应表
  18. 【C/C++】标准库之 numeric
  19. 计算机bios设置方法,bios功能怎么设置_bios设置图解教程
  20. MATLAB提取txt文本文档中特定关键字后的数字信息

热门文章

  1. 星形和雪花模型_星型模型和雪花型模型比较
  2. Vue跨域配置proxyTable中pathRewrite用法
  3. 订单服务-----功能实现逻辑
  4. DuerOS技能交互
  5. C/C++语言实现一百个数组之间求任意数组的交集和并集
  6. OA行业分析:大型企业如何选型OA系统?
  7. CentOS 关闭virbr0
  8. Clouder—构建企业级数据分析平台-墨羽@袋鼠云
  9. iMovie使用技巧
  10. 01-----Ubuntu16.04安装Gnome桌面环境