这里是用jupyter notebook写的pandas的基本用法,使用的环境是python3+windows,代码上传到csdn资源啦:ABC of Pandas

关于pandas学习还是强烈建议常去官方http://pandas.pydata.org/pandas-docs/stable/里查一查各种用法和toturial等。
下面是jupyter notebook代码导出的md文件。

数据结构 Series

from pandas import Seriesprint('用数组生成Series')
obj = Series([4, 7, -5, 3])
print(obj)
print(obj.values)
print(obj.index)
print()print('指定Series的index')
obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])
print(obj2)
print(obj2.index)
print(obj2['a'])
obj2['d'] = 6
print(obj2[['c', 'a', 'd']])
print(obj2[obj2 > 0])  # 找出大于0的元素
print('b' in obj2) # 判断索引是否存在
print('e' in obj2)
print()print('使用字典生成Series')
sdata = {'Ohio':45000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
obj3 = Series(sdata)
print(obj3)
print()print('使用字典生成Series,并额外指定index,不匹配部分为NaN。')
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index = states)
print(obj4)
print()print('Series相加,相同索引部分相加。')
print(obj3 + obj4)
print()print('指定Series及其索引的名字')
obj4.name = 'population'
obj4.index.name = 'state'
print(obj4)
print()print('替换index')
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
print(obj)
用数组生成Series
0    4
1    7
2   -5
3    3
dtype: int64
[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)指定Series的index
d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')
-5
c    3
a   -5
d    6
dtype: int64
d    6
b    7
c    3
dtype: int64
True
False使用字典生成Series
Ohio      45000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64使用字典生成Series,并额外指定index,不匹配部分为NaN。
California        NaN
Ohio          45000.0
Oregon        16000.0
Texas         71000.0
dtype: float64Series相加,相同索引部分相加。
California         NaN
Ohio           90000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64指定Series及其索引的名字
state
California        NaN
Ohio          45000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64替换index
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

数据结构 DataFrame

import numpy as np
from pandas import Series, DataFrameprint('用字典生成DataFrame,key为列的名字。')
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],'year':[2000, 2001, 2002, 2001, 2002],'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
print(DataFrame(data))
print(DataFrame(data, columns = ['year', 'state', 'pop'])) # 指定列顺序
print()print('指定索引,在列中指定不存在的列,默认数据用NaN。')
frame2 = DataFrame(data,columns = ['year', 'state', 'pop', 'debt'],index = ['one', 'two', 'three', 'four', 'five'])
print(frame2)
print(frame2['state'])
print(frame2.year)
print(frame2.ix['three'])
frame2['debt'] = 16.5 # 修改一整列
print(frame2)
frame2.debt = np.arange(5)  # 用numpy数组修改元素
print(frame2)
print()print('用Series指定要修改的索引及其对应的值,没有指定的默认数据用NaN。')
val = Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val
print(frame2)
print()print('赋值给新列')
frame2['eastern'] = (frame2.state == 'Ohio')  # 如果state等于Ohio为True
print(frame2)
print(frame2.columns)
print()print('DataFrame转置')
pop = {'Nevada':{2001:2.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame3 = DataFrame(pop)
print(frame3)
print(frame3.T)
print()print('指定索引顺序,以及使用切片初始化数据。')
print(DataFrame(pop, index = [2001, 2002, 2003]))
pdata = {'Ohio':frame3['Ohio'][:-1], 'Nevada':frame3['Nevada'][:2]}
print(DataFrame(pdata))
print()print('指定索引和列的名称')
frame3.index.name = 'year'
frame3.columns.name = 'state'
print(frame3)
print(frame3.values)
print(frame2.values)
用字典生成DataFrame,key为列的名字。pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9指定索引,在列中指定不存在的列,默认数据用NaN。year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: objectyear   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4用Series指定要修改的索引及其对应的值,没有指定的默认数据用NaN。year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7赋值给新列year   state  pop  debt eastern
one    2000    Ohio  1.5   NaN    True
two    2001    Ohio  1.7  -1.2    True
three  2002    Ohio  3.6   NaN    True
four   2001  Nevada  2.4  -1.5   False
five   2002  Nevada  2.9  -1.7   False
Index(['year', 'state', 'pop', 'debt', 'eastern'], dtype='object')DataFrame转置Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.62000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6指定索引顺序,以及使用切片初始化数据。Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaNNevada  Ohio
2000     NaN   1.5
2001     2.4   1.7指定索引和列的名称
state  Nevada  Ohio
year
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6
[[ nan  1.5][ 2.4  1.7][ 2.9  3.6]]
[[2000 'Ohio' 1.5 nan True][2001 'Ohio' 1.7 -1.2 True][2002 'Ohio' 3.6 nan True][2001 'Nevada' 2.4 -1.5 False][2002 'Nevada' 2.9 -1.7 False]]

数据结构 索引对象

import numpy as np
import pandas as pd
import sys
from pandas import Series, DataFrame, Indexprint('获取index')
obj = Series(range(3), index = ['a', 'b', 'c'])
index = obj.index
print(index[1:])
try:index[1] = 'd'  # index对象read only
except:print(sys.exc_info()[0])
print()print('使用Index对象')
index = Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index = index)
print(obj2)
print(obj2.index is index)
print()print('判断列和索引是否存在')
pop = {'Nevada':{20001:2.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame3 = DataFrame(pop)
print('Ohio' in frame3.columns)
print('2003' in frame3.index)
获取index
Index(['b', 'c'], dtype='object')
<class 'TypeError'>使用Index对象
0    1.5
1   -2.5
2    0.0
dtype: float64
True判断列和索引是否存在
True
False

基本功能 重新索引

import numpy as np
from pandas import DataFrame, Seriesprint('重新指定索引及顺序')
obj = Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
print(obj)
obj2 = obj.reindex(['a', 'b', 'd', 'c', 'e'])
print(obj2)
print(obj.reindex(['a', 'b', 'd', 'c', 'e'], fill_value = 0))  # 指定不存在元素的默认值
print()print('重新指定索引并指定填元素充方法')
obj3 = Series(['blue', 'purple', 'yellow'], index = [0, 2, 4])
print(obj3)
print(obj3.reindex(range(6), method = 'ffill'))
print()print('对DataFrame重新指定索引')
frame = DataFrame(np.arange(9).reshape(3, 3),index = ['a', 'c', 'd'],columns = ['Ohio', 'Texas', 'California'])
print(frame)
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)
print()print('重新指定column')
states = ['Texas', 'Utah', 'California']
print(frame.reindex(columns = states))
print()print('对DataFrame重新指定索引并指定填元素充方法')
print(frame.reindex(index = ['a', 'b', 'c', 'd'],method = 'ffill',columns = states))
print(frame.ix[['a', 'b', 'd', 'c'], states])
重新指定索引及顺序
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
a   -5.3
b    7.2
d    4.5
c    3.6
e    NaN
dtype: float64
a   -5.3
b    7.2
d    4.5
c    3.6
e    0.0
dtype: float64重新指定索引并指定填元素充方法
0      blue
2    purple
4    yellow
dtype: object
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object对DataFrame重新指定索引Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0重新指定columnTexas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8对DataFrame重新指定索引并指定填元素充方法Texas  Utah  California
a      1   NaN           2
b      1   NaN           2
c      4   NaN           5
d      7   NaN           8Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
d    7.0   NaN         8.0
c    4.0   NaN         5.0

基本功能 丢弃指定轴上的项

import numpy as np
from pandas import Series, DataFrameprint('Series根据索引删除元素')
obj = Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
print(new_obj)
print(obj.drop(['d', 'c']))
print()print('DataFrame删除元素,可指定索引或列。')
data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four'])
print(data)
print(data.drop(['Colorado', 'Ohio']))
print(data.drop('two', axis = 1))
print(data.drop(['two', 'four'], axis = 1))
Series根据索引删除元素
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
e    4.0
dtype: float64DataFrame删除元素,可指定索引或列。one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

基本功能 索引、选取和过滤

import numpy as np
from pandas import Series, DataFrameprint('Series的索引,默认数字索引可以工作。')
obj = Series(np.arange(4.), index = ['a', 'b', 'c', 'd'])
print(obj['b'])
print(obj[3])
print(obj[[1, 3]])
print(obj[obj < 2])
print()print('Series的数组切片')
print(obj['b':'c'])  # 闭区间
obj['b':'c'] = 5
print(obj)
print()print('DataFrame的索引')
data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four'])
print(data)
print(data['two']) # 打印列
print(data[['three', 'one']])
print(data[:2])
print(data.ix['Colorado', ['two', 'three']]) # 指定索引和列
print(data.ix[['Colorado', 'Utah'], [3, 0, 1]])
print(data.ix[2])  # 打印第2行(从0开始)
print(data.ix[:'Utah', 'two']) # 从开始到Utah,第2列。
print()print('根据条件选择')
print(data[data.three > 5])
print(data < 5)  # 打印True或者False
data[data < 5] = 0
print(data)
Series的索引,默认数字索引可以工作。
1.0
3.0
b    1.0
d    3.0
dtype: float64
a    0.0
b    1.0
dtype: float64Series的数组切片
b    1.0
c    2.0
dtype: float64
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64DataFrame的索引one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
two      5
three    6
Name: Colorado, dtype: int32four  one  two
Colorado     7    4    5
Utah        11    8    9
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32根据条件选择one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  Falseone  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

基本功能 算数运算和数据对齐

import numpy as np
from pandas import Series, DataFrameprint('加法')
s1 = Series([7.3, -2.5, 3.4, 1.5], index = ['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index = ['a', 'c', 'e', 'f', 'g'])
print(s1)
print(s2)
print(s1 + s2)
print()print('DataFrame加法,索引和列都必须匹配。')
df1 = DataFrame(np.arange(9.).reshape((3, 3)),columns = list('bcd'),index = ['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12).reshape((4, 3)),columns = list('bde'),index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df1)
print(df2)
print(df1 + df2)
print()print('数据填充')
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns = list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns = list('abcde'))
print(df1)
print(df2)
print(df1.add(df2, fill_value = 0))
print(df1.reindex(columns = df2.columns, fill_value = 0))
print()print('DataFrame与Series之间的操作')
arr = np.arange(12.).reshape((3, 4))
print(arr)
print(arr[0])
print(arr - arr[0])
frame = DataFrame(np.arange(12).reshape((4, 3)),columns = list('bde'),index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
print(frame)
print(series)
print(frame - series)
series2 = Series(range(3), index = list('bef'))
print(frame + series2)
series3 = frame['d']
print(frame.sub(series3, axis = 0))  # 按列减
加法
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64DataFrame加法,索引和列都必须匹配。b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0b   d   e
Utah    0   1   2
Ohio    3   4   5
Texas   6   7   8
Oregon  9  10  11b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN数据填充a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0  11.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0a    b     c     d  e
0  0.0  1.0   2.0   3.0  0
1  4.0  5.0   6.0   7.0  0
2  8.0  9.0  10.0  11.0  0DataFrame与Series之间的操作
[[  0.   1.   2.   3.][  4.   5.   6.   7.][  8.   9.  10.  11.]]
[ 0.  1.  2.  3.]
[[ 0.  0.  0.  0.][ 4.  4.  4.  4.][ 8.  8.  8.  8.]]b   d   e
Utah    0   1   2
Ohio    3   4   5
Texas   6   7   8
Oregon  9  10  11
b    0
d    1
e    2
Name: Utah, dtype: int32b  d  e
Utah    0  0  0
Ohio    3  3  3
Texas   6  6  6
Oregon  9  9  9b   d     e   f
Utah    0.0 NaN   3.0 NaN
Ohio    3.0 NaN   6.0 NaN
Texas   6.0 NaN   9.0 NaN
Oregon  9.0 NaN  12.0 NaNb  d  e
Utah   -1  0  1
Ohio   -1  0  1
Texas  -1  0  1
Oregon -1  0  1

基本功能 函数应用和映射

应用示例中有“取出最大值和最小值”和“所有数据保留为两位数”,写的挺好

import numpy as np
from pandas import Series, DataFrameprint('函数')
frame = DataFrame(np.random.randn(4, 3),columns = list('bde'),index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
print(np.abs(frame))
print()print('lambda以及应用')
f = lambda x: x.max() - x.min()
print(frame.apply(f))
print(frame.apply(f, axis = 1))
def f(x):return Series([x.min(), x.max()], index = ['min', 'max'])
print(frame.apply(f))
print()print('applymap和map')
_format = lambda x: '%.2f' % x
print(frame.applymap(_format))
print(frame['e'].map(_format))
函数b         d         e
Utah   -1.902941  2.046163  0.685950
Ohio   -0.377962  1.771554 -0.885729
Texas  -0.247769 -0.918187 -3.228982
Oregon -0.039260  2.129338 -0.077359b         d         e
Utah    1.902941  2.046163  0.685950
Ohio    0.377962  1.771554  0.885729
Texas   0.247769  0.918187  3.228982
Oregon  0.039260  2.129338  0.077359lambda以及应用
b    1.863681
d    3.047524
e    3.914932
dtype: float64
Utah      3.949104
Ohio      2.657283
Texas     2.981213
Oregon    2.206697
dtype: float64b         d         e
min -1.902941 -0.918187 -3.228982
max -0.039260  2.129338  0.685950applymap和mapb      d      e
Utah    -1.90   2.05   0.69
Ohio    -0.38   1.77  -0.89
Texas   -0.25  -0.92  -3.23
Oregon  -0.04   2.13  -0.08
Utah       0.69
Ohio      -0.89
Texas     -3.23
Oregon    -0.08
Name: e, dtype: object

基本功能 排序和排名

import numpy as np
from pandas import Series, DataFrameprint('根据索引排序,对于DataFrame可以指定轴。')
obj = Series(range(4), index = ['d', 'a', 'b', 'c'])
print(obj.sort_index())
frame = DataFrame(np.arange(8).reshape((2, 4)),index = ['three', 'one'],columns = list('dabc'))
print(frame.sort_index())
print(frame.sort_index(axis = 1))
print(frame.sort_index(axis = 1, ascending = False)) # 降序
print()print('根据值排序')
obj = Series([4, 7, -3, 2])
print(obj.sort_values()) # order已淘汰
print()print('DataFrame指定列排序')
frame = DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})
print(frame)
print(frame.sort_values(by = 'b')) # sort_index(by = ...)已淘汰
print(frame.sort_values(by = ['a', 'b']))
print()print('rank,求排名的平均位置(从1开始)')
obj = Series([7, -5, 7, 4, 2, 0, 4])
# 对应排名:-5(1), 0(2), 2(3), 4(4), 4(5), 7(6), 7(7)
print(obj.rank())
print(obj.rank(method = 'first'))  # 取第一次出现,不求平均值。
print(obj.rank(ascending = False, method = 'max')) # 逆序,并取最大值。所以-5的rank是7.
frame = DataFrame({'b':[4.3, 7, -3, 2],'a':[0, 1, 0, 1],'c':[-2, 5, 8, -2.5]})
print(frame)
print(frame.rank(axis = 1)) #按行排序
根据索引排序,对于DataFrame可以指定轴。
a    1
b    2
c    3
d    0
dtype: int32d  a  b  c
one    4  5  6  7
three  0  1  2  3a  b  c  d
three  1  2  3  0
one    5  6  7  4d  c  b  a
three  0  3  2  1
one    4  7  6  5根据值排序
2   -3
3    2
0    4
1    7
dtype: int64DataFrame指定列排序a  b
0  0  4
1  1  7
2  0 -3
3  1  2a  b
2  0 -3
3  1  2
0  0  4
1  1  7a  b
2  0 -3
0  0  4
3  1  2
1  1  7rank,求排名的平均位置(从1开始)
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64a    b    c
0  0  4.3 -2.0
1  1  7.0  5.0
2  0 -3.0  8.0
3  1  2.0 -2.5a    b    c
0  2.0  3.0  1.0
1  1.0  3.0  2.0
2  2.0  1.0  3.0
3  2.0  3.0  1.0

基本功能 带有重复值的索引

import numpy as np
from pandas import Series, DataFrameprint('重复的索引')
obj = Series(range(5), index = ['a', 'a', 'b', 'b', 'c'])
print(obj.index.is_unique) # 判断是非有重复索引
print(obj)
print(obj.a)
df = DataFrame(np.random.randn(4, 3), index = ['a', 'a', 'b', 'b'])
print(df)
print(df.ix['b'].ix[0])
print(df.ix['b'].ix[1])
重复的索引
False
a    0
a    1
b    2
b    3
c    4
dtype: int32
a    0
a    1
dtype: int320         1         2
a  0.986532  0.665410 -1.075305
a -0.060391  1.298336  0.013000
b -1.783654 -0.596269  0.632322
b  0.911897 -0.865876 -0.951782
0   -1.783654
1   -0.596269
2    0.632322
Name: b, dtype: float64
0    0.911897
1   -0.865876
2   -0.951782
Name: b, dtype: float64

汇总和计算描述统计

import numpy as np
from pandas import Series, DataFrameprint('求和')
df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],index = ['a', 'b', 'c', 'd'],columns = ['one', 'two'])
print(df)
print(df.sum())  # 按列求和
print(df.sum(axis = 1))  # 按行求和
print()print('平均数')
print(df.mean(axis = 1, skipna = False))
print(df.mean(axis = 1))
print()print('其它')
print(df.idxmax())
print(df.cumsum())
print(df.describe())
obj = Series(['a', 'a', 'b', 'c'] * 4)
print(obj)
print(obj.describe())
求和one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
one    9.25
two   -5.80
dtype: float64
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64平均数
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64其它
one    b
two    d
dtype: objectone  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8one       two
count  3.000000  2.000000
mean   3.083333 -2.900000
std    3.493685  2.262742
min    0.750000 -4.500000
25%         NaN       NaN
50%         NaN       NaN
75%         NaN       NaN
max    7.100000 -1.300000
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object
count     16
unique     3
top        a
freq       8
dtype: objectC:\Program Files\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3834: RuntimeWarning: Invalid value encountered in percentileRuntimeWarning)

汇总和计算描述统计 相关系数与协方差

链接失效时改一下日期即可

import numpy as np
import pandas_datareader.data as web
# import pandas.io.data as web
from pandas import DataFrameprint('相关性与协方差')  # 协方差:https://zh.wikipedia.org/wiki/%E5%8D%8F%E6%96%B9%E5%B7%AE
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:all_data[ticker] = web.get_data_yahoo(ticker, '4/1/2016', '7/15/2016')price = DataFrame({tic: data['Adj Close'] for tic, data in all_data.items()})volume = DataFrame({tic: data['Volume'] for tic, data in all_data.items()})
returns = price.pct_change()
print(returns.tail())
print(returns.MSFT.corr(returns.IBM))
print(returns.corr())  # 相关性,自己和自己的相关性总是1
print(returns.cov()) # 协方差
print(returns.corrwith(returns.IBM))
相关性与协方差AAPL      GOOG       IBM      MSFT
Date
2016-07-11  0.003103  0.013406  0.005632  0.005545
2016-07-12  0.004537  0.007761  0.011009  0.011789
2016-07-13 -0.005646 -0.005079  0.006241  0.005638
2016-07-14  0.019820  0.005537  0.014302  0.004298
2016-07-15 -0.000101 -0.001526 -0.003120 -0.000744
0.546959627072AAPL      GOOG       IBM      MSFT
AAPL  1.000000  0.389326  0.368959  0.389082
GOOG  0.389326  1.000000  0.624841  0.801473
IBM   0.368959  0.624841  1.000000  0.546960
MSFT  0.389082  0.801473  0.546960  1.000000AAPL      GOOG       IBM      MSFT
AAPL  0.000198  0.000068  0.000071  0.000083
GOOG  0.000068  0.000155  0.000107  0.000151
IBM   0.000071  0.000107  0.000188  0.000114
MSFT  0.000083  0.000151  0.000114  0.000229
AAPL    0.368959
GOOG    0.624841
IBM     1.000000
MSFT    0.546960
dtype: float64

汇总和计算描述统计 唯一值以及成员资格

import numpy as np
import pandas as pd
from pandas import Series, DataFrameprint('去重')
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print(obj.unique())
print(obj.value_counts())
print()print('判断元素存在')
mask = obj.isin(['b', 'c'])
print(mask)
print(obj[mask]) #只打印元素b和c
data = DataFrame({'Qu1':[1, 3, 4, 3, 4],'Qu2':[2, 3, 1, 2, 3],'Qu3':[1, 5, 2, 4, 4]})
print(data)
print(data.apply(pd.value_counts).fillna(0))
print(data.apply(pd.value_counts, axis = 1).fillna(0))
去重
['c' 'a' 'd' 'b']
a    3
c    3
b    2
d    1
dtype: int64判断元素存在
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
0    c
5    b
6    b
7    c
8    c
dtype: objectQu1  Qu2  Qu3
0    1    2    1
1    3    3    5
2    4    1    2
3    3    2    4
4    4    3    4Qu1  Qu2  Qu3
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0
5  0.0  0.0  1.01    2    3    4    5
0  2.0  1.0  0.0  0.0  0.0
1  0.0  0.0  2.0  0.0  1.0
2  1.0  1.0  0.0  1.0  0.0
3  0.0  1.0  1.0  1.0  0.0
4  0.0  0.0  1.0  2.0  0.0

处理缺失数据

import numpy as np
from pandas import Seriesprint('作为null处理的值')
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
print(string_data)
print(string_data.isnull())
string_data[0] = None
print(string_data.isnull())
作为null处理的值
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
0    False
1    False
2     True
3    False
dtype: bool
0     True
1    False
2     True
3    False
dtype: bool

处理缺失数据 滤除缺失数据

import numpy as np
from numpy import nan as NA
from pandas import Series, DataFrameprint('丢弃NA')
data = Series([1, NA, 3.5, NA, 7])
print(data.dropna())
print(data[data.notnull()])
print()print('DataFrame对丢弃NA的处理')
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
print(data.dropna()) # 默认只要某行有NA就全部删除
print(data.dropna(how = 'all'))  # 全部为NA才删除
data[4] = NA  # 新增一列
print(data)
print(data.dropna(axis = 1, how = 'all'))
data = DataFrame(np.random.randn(7, 3))
data.ix[:4, 1] = NA
data.ix[:2, 2] = NA
print(data)
print(data.dropna(thresh = 2)) # 每行至少要有2个非NA元素
丢弃NA
0    1.0
2    3.5
4    7.0
dtype: float64
0    1.0
2    3.5
4    7.0
dtype: float64DataFrame对丢弃NA的处理0    1    2
0  1.0  6.5  3.00    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.00    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.00         1         2
0  1.943456       NaN       NaN
1  0.614118       NaN       NaN
2  0.557657       NaN       NaN
3  1.405687       NaN -0.041758
4 -1.427269       NaN  0.387618
5  0.622559  0.160317  0.532953
6  0.501193 -0.624722  0.9158840         1         2
3  1.405687       NaN -0.041758
4 -1.427269       NaN  0.387618
5  0.622559  0.160317  0.532953
6  0.501193 -0.624722  0.915884

处理缺失数据 填充缺失数据

import numpy as np
from numpy import nan as NA
import pandas as pd
from pandas import Series, DataFrame, Indexprint('填充0')
df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA
df.ix[:2, 2] = NA
print(df.fillna(0))
df.fillna(0, inplace = True)
print(df)
print()print('不同行列填充不同的值')
df.ix[:4, 1] = NA
df.ix[:2, 2] = NA
print(df.fillna({1:0.5, 3:-1}))  # 第3列不存在
print()print('不同的填充方式')
df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA
df.ix[4:, 2] = NA
print(df)
print(df.fillna(method = 'ffill'))
print(df.fillna(method = 'ffill', limit = 2))
print()print('用统计数据填充')
data = Series([1., NA, 3.5, NA, 7])
print(data.fillna(data.mean()))
填充00         1         2
0 -0.675293  0.000000  0.000000
1 -0.930013  0.000000  0.000000
2 -1.873707  0.000000  0.000000
3 -0.917852  0.000000 -0.978845
4  1.353961  0.000000  0.308181
5  0.133550 -0.566566  0.357234
6  0.987223 -0.633359  0.4080850         1         2
0 -0.675293  0.000000  0.000000
1 -0.930013  0.000000  0.000000
2 -1.873707  0.000000  0.000000
3 -0.917852  0.000000 -0.978845
4  1.353961  0.000000  0.308181
5  0.133550 -0.566566  0.357234
6  0.987223 -0.633359  0.408085不同行列填充不同的值0         1         2
0 -0.675293  0.500000       NaN
1 -0.930013  0.500000       NaN
2 -1.873707  0.500000       NaN
3 -0.917852  0.500000 -0.978845
4  1.353961  0.500000  0.308181
5  0.133550 -0.566566  0.357234
6  0.987223 -0.633359  0.408085不同的填充方式0         1         2
0 -0.193422 -0.716321 -0.607555
1 -1.430887 -1.012584  3.469360
2  1.425113       NaN -0.079921
3 -0.668439       NaN -1.251606
4 -0.032895       NaN       NaN
5  1.316025       NaN       NaN0         1         2
0 -0.193422 -0.716321 -0.607555
1 -1.430887 -1.012584  3.469360
2  1.425113 -1.012584 -0.079921
3 -0.668439 -1.012584 -1.251606
4 -0.032895 -1.012584 -1.251606
5  1.316025 -1.012584 -1.2516060         1         2
0 -0.193422 -0.716321 -0.607555
1 -1.430887 -1.012584  3.469360
2  1.425113 -1.012584 -0.079921
3 -0.668439 -1.012584 -1.251606
4 -0.032895       NaN -1.251606
5  1.316025       NaN -1.251606用统计数据填充
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

层次化索引

import numpy as np
from pandas import Series, DataFrame, MultiIndexprint('Series的层次索引')
data = Series(np.random.randn(10),index = [['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],[1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
print(data)
print(data.index)
print(data.b)
print(data['b':'c'])
print(data[:2])
print(data.unstack())
print(data.unstack().stack())
print()print('DataFrame的层次索引')
frame = DataFrame(np.arange(12).reshape((4, 3)),index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
print(frame)
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
print(frame)
print(frame.ix['a', 1])
print(frame.ix['a', 2]['Colorado'])
print(frame.ix['a', 2]['Ohio']['Red'])
print()print('直接用MultiIndex创建层次索引结构')
print(MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Gree', 'Red', 'Green']],names = ['state', 'color']))
Series的层次索引
a  1   -0.9261872   -0.3898143   -0.193285
b  1    0.8676882    1.0440333    1.010972
c  1   -1.2688972   -0.659126
d  2    1.5793033    0.167862
dtype: float64
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
1    0.867688
2    1.044033
3    1.010972
dtype: float64
b  1    0.8676882    1.0440333    1.010972
c  1   -1.2688972   -0.659126
dtype: float64
a  1   -0.9261872   -0.389814
dtype: float641         2         3
a -0.926187 -0.389814 -0.193285
b  0.867688  1.044033  1.010972
c -1.268897 -0.659126       NaN
d       NaN  1.579303  0.167862
a  1   -0.9261872   -0.3898143   -0.193285
b  1    0.8676882    1.0440333    1.010972
c  1   -1.2688972   -0.659126
d  2    1.5793033    0.167862
dtype: float64DataFrame的层次索引Ohio     ColoradoGreen Red    Green
a 1     0   1        22     3   4        5
b 1     6   7        82     9  10       11
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        22        3   4        5
b    1        6   7        82        9  10       11
state     color
Ohio      Green    0Red      1
Colorado  Green    2
Name: (a, 1), dtype: int32
color
Green    5
Name: (a, 2), dtype: int32
4直接用MultiIndex创建层次索引结构
MultiIndex(levels=[['Colorado', 'Ohio'], ['Gree', 'Green', 'Red']],labels=[[1, 1, 0], [0, 2, 1]],names=['state', 'color'])

层次化索引 重新分级顺序

import numpy as np
from pandas import Series, DataFrameprint('索引层级交换')
frame = DataFrame(np.arange(12).reshape((4, 3)),index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame.index.names = ['key1', 'key2']
frame_swapped = frame.swaplevel('key1', 'key2')
print(frame_swapped)
print(frame_swapped.swaplevel(0, 1))
print()print('根据索引排序')
print(frame.sortlevel('key2'))
print(frame.swaplevel(0, 1).sortlevel(0))
索引层级交换Ohio     ColoradoGreen Red    Green
key2 key1
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11Ohio     ColoradoGreen Red    Green
key1 key2
a    1        0   1        22        3   4        5
b    1        6   7        82        9  10       11根据索引排序Ohio     ColoradoGreen Red    Green
key1 key2
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11Ohio     ColoradoGreen Red    Green
key2 key1
1    a        0   1        2b        6   7        8
2    a        3   4        5b        9  10       11

层次化索引 根据级别汇总统计

import numpy as np
from pandas import DataFrameprint('根据指定的key计算统计信息')
frame = DataFrame(np.arange(12).reshape((4, 3)),index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame.index.names = ['key1', 'key2']
print(frame)
print(frame.sum(level = 'key2'))
根据指定的key计算统计信息Ohio     ColoradoGreen Red    Green
key1 key2
a    1        0   1        22        3   4        5
b    1        6   7        82        9  10       11Ohio     ColoradoGreen Red    Green
key2
1        6   8       10
2       12  14       16

层次化索引 使用DataFrame的列

import numpy as np
from pandas import DataFrameprint('使用列生成层次索引')
frame = DataFrame({'a':range(7),'b':range(7, 0, -1),'c':['one', 'one', 'one', 'two', 'two', 'two', 'two'],'d':[0, 1, 2, 0, 1, 2, 3]})
print(frame)
print(frame.set_index(['c', 'd']))  # 把c/d列变成索引
print(frame.set_index(['c', 'd'], drop = False)) # 列依然保留
frame2 = frame.set_index(['c', 'd'])
print(frame2.reset_index()) #reset_index恢复
使用列生成层次索引a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3a  b
c   d
one 0  0  71  1  62  2  5
two 0  3  41  4  32  5  23  6  1a  b    c  d
c   d
one 0  0  7  one  01  1  6  one  12  2  5  one  2
two 0  3  4  two  01  4  3  two  12  5  2  two  23  6  1  two  3c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  two  0  3  4
4  two  1  4  3
5  two  2  5  2
6  two  3  6  1

其它话题 整数索引

import numpy as np
import sys
from pandas import Series, DataFrameprint('整数索引')
ser = Series(np.arange(3.))
print(ser)
try:print(ser[-1]) # 这里会有歧义
except:print(sys.exc_info()[0])
ser2 = Series(np.arange(3.), index = ['a', 'b', 'c'])
print(ser2[-1])
ser3 = Series(range(3), index = [-5, 1, 3])
print(ser3.iloc[2])  # 避免直接用[2]产生的歧义
print()print('对DataFrame使用整数索引')
frame = DataFrame(np.arange(6).reshape((3, 2)), index = [2, 0, 1])
print(frame)
print(frame.iloc[0])
print(frame.iloc[:, 1])
整数索引
0    0.0
1    1.0
2    2.0
dtype: float64
<class 'KeyError'>
2.0
2对DataFrame使用整数索引0  1
2  0  1
0  2  3
1  4  5
0    0
1    1
Name: 2, dtype: int32
2    1
0    3
1    5
Name: 1, dtype: int32

其它话题 面板(Pannel)数据

items axis:列条款轴
major axis:行主轴
minor axis:行副轴

import numpy as np
import pandas as pd
import pandas_datareader.data as web
from pandas import Series, DataFrame, Index, Panelpdata = Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2016', '1/15/2016')) for stk in ['AAPL', 'GOOG', 'BIDU', 'MSFT']))
print(pdata)
pdata = pdata.swapaxes('items', 'minor')
print(pdata)
print()print("访问顺序:# Item -> Major -> Minor")
print(pdata['Adj Close'])
print(pdata[:, '1/5/2016', :])
print(pdata['Adj Close', '1/6/2016', :])
print()print('Panel与DataFrame相互转换')
stacked = pdata.ix[:, '1/7/2016':, :].to_frame()
print(stacked)
print(stacked.to_panel())
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 10 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2016-01-04 00:00:00 to 2016-01-15 00:00:00
Minor_axis axis: Open to Adj Close
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2016-01-04 00:00:00 to 2016-01-15 00:00:00
Minor_axis axis: AAPL to MSFT访问顺序:# Item -> Major -> MinorAAPL        BIDU        GOOG       MSFT
Date
2016-01-04  103.057063  184.029999  741.840027  53.336436
2016-01-05  100.474523  187.369995  742.580017  53.579760
2016-01-06   98.508268  185.610001  743.619995  52.606467
2016-01-07   94.350769  174.369995  726.390015  50.776676
2016-01-08   94.849671  171.339996  714.469971  50.932406
2016-01-11   96.385500  170.369995  716.030029  50.903205
2016-01-12   97.784376  172.910004  726.070007  51.370385
2016-01-13   95.270312  166.039993  700.559998  50.260832
2016-01-14   97.353950  172.229996  714.719971  51.691573
2016-01-15   95.015969  163.919998  694.450012  49.628194Open        High         Low       Close      Volume   Adj Close
AAPL  105.750000  105.849998  102.410004  102.709999  55791000.0  100.474523
BIDU  186.710007  189.839996  185.339996  187.369995   3367900.0  187.369995
GOOG  746.450012  752.000000  738.640015  742.580017   1950700.0  742.580017
MSFT   54.930000   55.389999   54.540001   55.049999  34079700.0   53.579760
AAPL     98.508268
BIDU    185.610001
GOOG    743.619995
MSFT     52.606467
Name: 2016-01-06 00:00:00, dtype: float64Panel与DataFrame相互转换Open        High         Low       Close      Volume  \
Date       minor
2016-01-07 AAPL    98.680000  100.129997   96.430000   96.449997  81094400.0   BIDU   176.300003  180.979996  171.460007  174.369995   7009100.0   GOOG   730.309998  738.500000  719.059998  726.390015   2963700.0   MSFT    52.700001   53.490002   52.070000   52.169998  56564900.0
2016-01-08 AAPL    98.550003   99.110001   96.760002   96.959999  70798000.0   BIDU   177.589996  178.949997  170.720001  171.339996   4682800.0   GOOG   731.450012  733.229980  713.000000  714.469971   2450900.0   MSFT    52.369999   53.279999   52.150002   52.330002  48754000.0
2016-01-11 AAPL    98.970001   99.059998   97.339996   98.529999  49739400.0   BIDU   172.000000  172.839996  167.710007  170.369995   3202300.0   GOOG   716.609985  718.854980  703.539978  716.030029   2090600.0   MSFT    52.509998   52.849998   51.459999   52.299999  36943800.0
2016-01-12 AAPL   100.550003  100.690002   98.839996   99.959999  49154200.0   BIDU   172.399994  173.490005  169.820007  172.910004   3039600.0   GOOG   721.679993  728.750000  717.317017  726.070007   2024500.0   MSFT    52.759998   53.099998   52.060001   52.779999  36095500.0
2016-01-13 AAPL   100.320000  101.190002   97.300003   97.389999  62439600.0   BIDU   174.250000  175.360001  164.000000  166.039993   3961500.0   GOOG   730.849976  734.739990  698.609985  700.559998   2501700.0   MSFT    53.799999   54.070000   51.299999   51.639999  66883600.0
2016-01-14 AAPL    97.959999  100.480003   95.739998   99.519997  63170100.0   BIDU   166.300003  174.880005  163.009995  172.229996   4082200.0   GOOG   705.380005  721.924988  689.099976  714.719971   2225800.0   MSFT    52.000000   53.419998   51.570000   53.110001  52381900.0
2016-01-15 AAPL    96.199997   97.709999   95.360001   97.129997  79010000.0   BIDU   163.759995  165.309998  158.119995  163.919998   6342800.0   GOOG   692.289978  706.739990  685.369995  694.450012   3592400.0   MSFT    51.310001   51.970001   50.340000   50.990002  70739100.0   Adj Close
Date       minor
2016-01-07 AAPL    94.350769  BIDU   174.369995  GOOG   726.390015  MSFT    50.776676
2016-01-08 AAPL    94.849671  BIDU   171.339996  GOOG   714.469971  MSFT    50.932406
2016-01-11 AAPL    96.385500  BIDU   170.369995  GOOG   716.030029  MSFT    50.903205
2016-01-12 AAPL    97.784376  BIDU   172.910004  GOOG   726.070007  MSFT    51.370385
2016-01-13 AAPL    95.270312  BIDU   166.039993  GOOG   700.559998  MSFT    50.260832
2016-01-14 AAPL    97.353950  BIDU   172.229996  GOOG   714.719971  MSFT    51.691573
2016-01-15 AAPL    95.015969  BIDU   163.919998  GOOG   694.450012  MSFT    49.628194
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 7 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2016-01-07 00:00:00 to 2016-01-15 00:00:00
Minor_axis axis: AAPL to MSFT

例题

08年的一个股票数据

# -*- coding: utf-8 -*-import numpy as np
import pandas as pdnames = ['date',  'time','opening_price',  #开盘价'ceiling_price',  #最高价'floor_price',    #最低价'closing_price',  #收盘价'volume',         #成交量'amount']         #成交金额
raw = pd.read_csv('SH600690.csv', names = names, header = None, index_col='date', parse_dates=True)
print(raw.head())
print()
             time  opening_price  ceiling_price  floor_price  closing_price  \
date
2008-01-02  09:35          22.50          22.63        22.50          22.51
2008-01-02  09:40          22.51          22.51        22.29          22.37
2008-01-02  09:45          22.39          22.62        22.38          22.62
2008-01-02  09:50          22.60          23.00        22.60          22.95
2008-01-02  09:55          22.98          23.20        22.89          23.20   volume      amount
date
2008-01-02   2042.50   4604723.0
2008-01-02   1545.17   3460503.0
2008-01-02   1744.76   3921443.0
2008-01-02   5339.00  12225939.0
2008-01-02  12577.73  28947824.0
# 根据涨跌幅判断数据是否有效
print("根据涨跌幅判断数据是否有效")
def _valid_price(prices):return (((prices.max() - prices.min()) / prices.min()) < 0.223).all()
根据涨跌幅判断数据是否有效
# 按日期分组
print("按日期分组")
days = raw.groupby(level = 0).agg({'opening_price':lambda prices: _valid_price(prices) and prices[0] or 0,'ceiling_price':lambda prices: _valid_price(prices) and np.max(prices) or 0,'floor_price':lambda prices: _valid_price(prices) and np.min(prices) or 0,'closing_price':lambda prices: _valid_price(prices) and prices[-1] or 0,'volume':'sum','amount':'sum'})
print(days.head())
print()
按日期分组floor_price  ceiling_price     volume       amount  closing_price  \
date
2008-01-02        22.29          24.50  200809.34  476179680.0          24.03
2008-01-03        23.81          25.20  166037.98  406906304.0          24.54
2008-01-04        23.68          24.76  149078.64  358418560.0          24.17
2008-01-07        23.75          24.75   93950.43  227289136.0          24.38
2008-01-08        23.49          24.38  149056.24  355752416.0          23.53   opening_price
date
2008-01-02          22.50
2008-01-03          24.03
2008-01-04          24.53
2008-01-07          24.03
2008-01-08          24.38
# 缺少数据处理,因为周末没有交易。
print("缺少数据处理,因为周末没有交易")
start = days.iloc[0:1].index.tolist()[0]
end = days.iloc[-2:-1].index.tolist()[0]
new_idx = pd.date_range(start = start, end = end)
print(new_idx)
data = days.reindex(new_idx)    # 重新索引
zero_values = data.loc[~(data.volume > 0)].loc[:, ['volume', 'amount']]
data.update(zero_values.fillna(0))  # 交易量和金额填0
data.fillna(method = 'ffill', inplace = True)   # 价格用前一天的填充
print(data.head())
print()
缺少数据处理,因为周末没有交易
DatetimeIndex(['2008-01-02', '2008-01-03', '2008-01-04', '2008-01-05','2008-01-06', '2008-01-07', '2008-01-08', '2008-01-09','2008-01-10', '2008-01-11',...'2008-12-21', '2008-12-22', '2008-12-23', '2008-12-24','2008-12-25', '2008-12-26', '2008-12-27', '2008-12-28','2008-12-29', '2008-12-30'],dtype='datetime64[ns]', length=364, freq='D')floor_price  ceiling_price     volume       amount  closing_price  \
2008-01-02        22.29          24.50  200809.34  476179680.0          24.03
2008-01-03        23.81          25.20  166037.98  406906304.0          24.54
2008-01-04        23.68          24.76  149078.64  358418560.0          24.17
2008-01-05        23.68          24.76       0.00          0.0          24.17
2008-01-06        23.68          24.76       0.00          0.0          24.17   opening_price
2008-01-02          22.50
2008-01-03          24.03
2008-01-04          24.53
2008-01-05          24.53
2008-01-06          24.53
# 计算30各自然日里的股票平均波动周率
print("计算30各自然日里的股票平均波动周率")
def gen_item_group_index(total, group_len):group_count = total / group_lengroup_index = np.arange(total)for i in range(int(group_count)):group_index[i * group_len: (i+ 1) * group_len] = igroup_index[(i + 1) * group_len:] = i +1return group_index.tolist()period = 30
group_index = gen_item_group_index(len(data), period)
data['group_index'] = group_index
print(data.head().append(data.tail()))
计算30各自然日里的股票平均波动周率floor_price  ceiling_price     volume       amount  closing_price  \
2008-01-02        22.29          24.50  200809.34  476179680.0          24.03
2008-01-03        23.81          25.20  166037.98  406906304.0          24.54
2008-01-04        23.68          24.76  149078.64  358418560.0          24.17
2008-01-05        23.68          24.76       0.00          0.0          24.17
2008-01-06        23.68          24.76       0.00          0.0          24.17
2008-12-26         8.97           9.23   35093.86   31741640.0           9.08
2008-12-27         8.97           9.23       0.00          0.0           9.08
2008-12-28         8.97           9.23       0.00          0.0           9.08
2008-12-29         8.73           9.15   38576.07   34625144.0           9.11
2008-12-30         8.95           9.14   62983.38   56876600.0           8.96   opening_price  group_index
2008-01-02          22.50            0
2008-01-03          24.03            0
2008-01-04          24.53            0
2008-01-05          24.53            0
2008-01-06          24.53            0
2008-12-26           9.15           11
2008-12-27           9.15           12
2008-12-28           9.15           12
2008-12-29           9.04           12
2008-12-30           9.14           12
# 为负表示先出现最高价再出现最低价,即下跌波动。
print("为负表示先出现最高价再出现最低价,即下跌波动")
def _ceiling_price(prices):return prices.idxmin() < prices.idxmax() and np.max(prices) or (-np.max(prices))group = data.groupby('group_index').agg({'volume': 'sum','floor_price': 'min','ceiling_price': _ceiling_price})
print(group.head())
date_col = pd.DataFrame({'group_index': group_index, 'date': new_idx})
print(date_col)
group['date'] = date_col.groupby('group_index').agg('first')    # 为每个索引添加开始日期
print(group.head())
group['ripples_ratio'] = group.ceiling_price / group.floor_price    # 计算并添加波动率
print(group.head())
print()
为负表示先出现最高价再出现最低价,即下跌波动floor_price  ceiling_price      volume
group_index
0                  20.20         -25.46  3200204.74
1                  18.30          23.23  1310744.12
2                  13.31         -21.40  1749185.77
3                   9.18         -14.60  3978126.81
4                  11.58         -14.85  2642320.68date  group_index
0   2008-01-02            0
1   2008-01-03            0
2   2008-01-04            0
3   2008-01-05            0
4   2008-01-06            0
5   2008-01-07            0
6   2008-01-08            0
7   2008-01-09            0
8   2008-01-10            0
9   2008-01-11            0
10  2008-01-12            0
11  2008-01-13            0
12  2008-01-14            0
13  2008-01-15            0
14  2008-01-16            0
15  2008-01-17            0
16  2008-01-18            0
17  2008-01-19            0
18  2008-01-20            0
19  2008-01-21            0
20  2008-01-22            0
21  2008-01-23            0
22  2008-01-24            0
23  2008-01-25            0
24  2008-01-26            0
25  2008-01-27            0
26  2008-01-28            0
27  2008-01-29            0
28  2008-01-30            0
29  2008-01-31            0
..         ...          ...
334 2008-12-01           11
335 2008-12-02           11
336 2008-12-03           11
337 2008-12-04           11
338 2008-12-05           11
339 2008-12-06           11
340 2008-12-07           11
341 2008-12-08           11
342 2008-12-09           11
343 2008-12-10           11
344 2008-12-11           11
345 2008-12-12           11
346 2008-12-13           11
347 2008-12-14           11
348 2008-12-15           11
349 2008-12-16           11
350 2008-12-17           11
351 2008-12-18           11
352 2008-12-19           11
353 2008-12-20           11
354 2008-12-21           11
355 2008-12-22           11
356 2008-12-23           11
357 2008-12-24           11
358 2008-12-25           11
359 2008-12-26           11
360 2008-12-27           12
361 2008-12-28           12
362 2008-12-29           12
363 2008-12-30           12[364 rows x 2 columns]floor_price  ceiling_price      volume       date
group_index
0                  20.20         -25.46  3200204.74 2008-01-02
1                  18.30          23.23  1310744.12 2008-02-01
2                  13.31         -21.40  1749185.77 2008-03-02
3                   9.18         -14.60  3978126.81 2008-04-01
4                  11.58         -14.85  2642320.68 2008-05-01floor_price  ceiling_price      volume       date  ripples_ratio
group_index
0                  20.20         -25.46  3200204.74 2008-01-02      -1.260396
1                  18.30          23.23  1310744.12 2008-02-01       1.269399
2                  13.31         -21.40  1749185.77 2008-03-02      -1.607814
3                   9.18         -14.60  3978126.81 2008-04-01      -1.590414
4                  11.58         -14.85  2642320.68 2008-05-01      -1.282383
# 波动率排序
print("波动率排序")
ripples = group.sort_values('ripples_ratio', ascending = False)
print(ripples)
print(ripples.head(10).ripples_ratio.mean())
print(ripples.tail(10).ripples_ratio.mean())
print()
波动率排序floor_price  ceiling_price      volume       date  ripples_ratio
group_index
6                   8.55          11.48  2832764.49 2008-06-30       1.342690
10                  7.30           9.52  1405050.39 2008-10-28       1.304110
1                  18.30          23.23  1310744.12 2008-02-01       1.269399
11                  8.83          10.50  2247432.74 2008-11-27       1.189128
12                  8.73          -9.23   101559.45 2008-12-27      -1.057274
0                  20.20         -25.46  3200204.74 2008-01-02      -1.260396
8                   7.70          -9.85  1193866.98 2008-08-29      -1.279221
4                  11.58         -14.85  2642320.68 2008-05-01      -1.282383
9                   6.92          -9.28   748169.86 2008-09-28      -1.341040
7                   8.02         -11.11  1297577.05 2008-07-30      -1.385287
5                   8.02         -12.44  1285826.32 2008-05-31      -1.551122
3                   9.18         -14.60  3978126.81 2008-04-01      -1.590414
2                  13.31         -21.40  1749185.77 2008-03-02      -1.607814
-0.2500274725153588
-1.1165823091576255
# 计算涨跌幅
print("计算涨跌幅")
rise = data.closing_price.diff()
data['rise'] = rise
print(data.head())
计算涨跌幅floor_price  ceiling_price     volume       amount  closing_price  \
2008-01-02        22.29          24.50  200809.34  476179680.0          24.03
2008-01-03        23.81          25.20  166037.98  406906304.0          24.54
2008-01-04        23.68          24.76  149078.64  358418560.0          24.17
2008-01-05        23.68          24.76       0.00          0.0          24.17
2008-01-06        23.68          24.76       0.00          0.0          24.17   opening_price  group_index  rise
2008-01-02          22.50            0   NaN
2008-01-03          24.03            0  0.51
2008-01-04          24.53            0 -0.37
2008-01-05          24.53            0  0.00
2008-01-06          24.53            0  0.00

pandas基本使用方法示例相关推荐

  1. Python小练习2:pandas.Dataframe使用方法示例demo

    pandas.Dataframe使用方法示例demo 本文通过一个实例来介绍pandas.Dataframe的各种常用操作,问题总结并修改自coursera上南京大学的课程:用Python玩转数据. ...

  2. python pandas合并单元格_利用Python pandas对Excel进行合并的方法示例

    前言 在网上找了很多Python处理Excel的方法和代码,都不是很尽人意,所以自己综合网上各位大佬的方法,自己进行了优化,具体的代码如下. 博主也是新手一枚,代码肯定有很多需要优化的地方,欢迎各位大 ...

  3. python 取反_自从用了这招pandas 空数据处理方法,python编程速度提升了不少

    今天为大家带来的内容是:自从用了这招pandas 空数据处理方法,python编程速度提升了不少 文章内容主要介绍了pandas 空数据处理方法详解,文中通过示例代码介绍的非常详细,对大家的学习或者工 ...

  4. Pandas中resample方法详解

    Pandas中resample方法详解 Pandas中的resample,重新采样,是对原样本重新处理的一个方法,是一个对常规时间序列数据重新采样和频率转换的便捷的方法.重新取样时间序列数据. 方便的 ...

  5. 基于python的证件照_20行代码教你用python给证件照换底色的方法示例

    1.图片来源 该图片来源于百度图片,如果侵权,请联系我删除!图片仅用于知识交流. 2.读取图片并显示 imread():读取图片: imshow():展示图片: waitkey():设置窗口等待,如果 ...

  6. java readshort_Java DataInputStream.readShort()方法示例

    Java DataInputStream.readShort()方法示例 DataInputStream.readShort()方法的语法如下所示. public final short readSh ...

  7. pandas使用datetime创建示例dataframe:包含完整的时分秒信息、创建时间索引、使用不同的时间频率、date_range中参数freq的偏移量别名

    pandas使用datetime创建示例dataframe:包含完整的时分秒信息.创建时间索引.使用不同的时间频率.date_range中参数freq的偏移量别名 目录

  8. pandas使用shift方法进行特征差分

    pandas使用shift方法进行特征差分 目录 pandas使用shift方法进行特征差分 #向后偏移一步差分 #向前偏移一步差分

  9. R语言临床诊断试验的ROC分析方法示例

    R语言临床诊断试验的ROC分析方法示例 ROC(receiver operating characteristic curve)接收者操作特征曲线,是由二战中的电子工程师和雷达工程师发明用来侦测战场上 ...

最新文章

  1. 最新县及县以上行政区划代码(截止2010年12月31日)
  2. 【数据挖掘】决策树 分类 ( 抽取分类规则 | 过拟合 | 剪枝 | 先剪 | 后剪 | 连续数值离散化 | 最优化分点 | 增益率选择划分属性 )
  3. C++ Primer 5th笔记(chap 17 标准库特殊设施)输出补白
  4. mysql galera cluster实现vip_2019年学MySQL,最佳的10本新书
  5. HDU 3788 ZOJ问题
  6. iOS逆向之深入解析如何计算+load方法的耗时
  7. Nginx安全说:一剑封喉
  8. .NET Core ResponseCache【缓存篇(一)】
  9. C语言宏定义取得两数的最大值和最小值
  10. leetcode实践:通过链表存储两数之和
  11. Connect to repo.maven.apache.org:443 [repo.maven.apache.org/151.101.24.215] failed: connect timed ou
  12. asp.net ajax 源码,asp.net+jquery+ajax简单留言板 v1.2
  13. C#中扩展StringBuilder支持链式方法
  14. 读书笔记—《雷达信号处理基础》第一章 雷达系统与信号处理概述(2)
  15. 吞食天地2刘蜀传1.8版本图文攻略
  16. 关系数据库中常用的数据结构
  17. 关于通过邮箱找回密码的实现
  18. 美食杰-菜谱大全(二)
  19. 没有IP地址的主机如何保持IP层联通
  20. 雷军北大15分钟演讲:我至少有胆量去想(转)

热门文章

  1. CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Al
  2. 陪你读书之《R语言实战》
  3. 女儿,是父亲上辈子不能在一起的情人
  4. 从李开复换门庭开启的缝隙中窥视
  5. 【概率论】联合多项分布协方差推导——基于条件概率公式
  6. 链表的回文结构(有图易懂)
  7. 2018北京网络赛 HihoCoder - 1835 K-Dimensional Foil II 计算几何 贪心 二分
  8. 文学 2014年2月23日 20:43:43
  9. JavaScript 选择排序
  10. 奇技淫巧-STL 库 ACM算法小技巧(持续更新中~~~)