Python财务分析-Pandas基础

以下代码按顺序执行，在ipython中执行，如果再PyCharm中，按包调用即可，如 Series([1, 2, 3, 4, 5], index=labels)，使用 pd.Series([1, 2, 3, 4, 5], index=labels)，pd为导入pandas时的昵称。

数据分析/创意产生

打开ipython

from IPython.display import IFrame
IFrame(src="http://pandas.pydata.org", width=800, height=350)
----------
<IPython.lib.display.IFrame at 0x222d1e0dc50>

import datetime
import pandas as pd
import pandas_datareader
from pandas import Series, DataFrame
print(pd.__version__)
----------
0.25.1

import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rc('figure', figsize=(8, 7))
mpl.__version__
----------
3.0.2

创建/加载时间序列数据

Python结构
雅虎财经
CSV文件

Python结构

labels = ['a', 'b', 'c', 'd', 'e']
s = Series([1, 2, 3, 4, 5], index=labels)
s
----------
a    1
b    2
c    3
d    4
e    5
dtype: int64

'b' in s
----------
True

s['b']
----------
2

mapping = s.to_dict()
mapping
----------
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

Series(mapping)
----------
a    1
b    2
c    3
d    4
e    5
dtype: int64

雅虎财经

import pandas_datareader as pd_data
import datetime
aapl = pd_data.data.get_data_yahoo('AAPL',start=datetime.datetime(2006, 10, 1),end=datetime.datetime(2012, 1, 1))
aapl.head()
----------High        Low       Open      Close     Volume  Adj Close
Date
2006-10-02  10.838572  10.614285  10.728572  10.694285  178159800   9.283074
2006-10-03  10.707143  10.455714  10.635715  10.582857  197677200   9.186352
2006-10-04  10.780000  10.451428  10.585714  10.768572  207270700   9.347562
2006-10-05  10.880000  10.590000  10.647142  10.690000  170970800   9.279355
2006-10-06  10.720000  10.544286  10.631429  10.602858  116739700   9.203714

CSV文件

需要在当前目录下创建这个文件 data/aapl_ohlc.csv

Date,Open,High,Low,Close,Volume,Adj Close
2006-10-02,75.1,75.87,74.3,74.86,25451400,73.29
2006-10-03,74.45,74.95,73.19,74.08,28239600,72.52
2006-10-04,74.1,75.46,73.16,75.38,29610100,73.8
2006-10-05,74.53,76.16,74.13,74.83,24424400,73.26
2006-10-06,74.42,75.04,73.81,74.22,16677100,72.66
2006-10-09,73.8,75.08,73.53,74.63,15650800,73.06
2006-10-10,74.54,74.58,73.08,73.81,18985300,72.26
2006-10-11,73.42,73.98,72.6,73.23,20423400,71.69
2006-10-12,73.61,75.39,73.6,75.26,21173400,73.68

aapl.to_csv('data/aapl_ohlc.csv')
!head data/aapl_ohlc.csvdf = pd.read_csv('data/aapl_ohlc.csv', index_col='Date', parse_dates=True)
df.head()
----------High        Low       Open      Close     Volume  Adj Close
Date
2006-10-02  10.838572  10.614285  10.728572  10.694285  178159800   9.283074
2006-10-03  10.707143  10.455714  10.635715  10.582857  197677200   9.186352
2006-10-04  10.780000  10.451428  10.585714  10.768572  207270700   9.347562
2006-10-05  10.880000  10.590000  10.647142  10.690000  170970800   9.279355
2006-10-06  10.720000  10.544286  10.631429  10.602858  116739700   9.203714

df.index
----------
DatetimeIndex(['2006-10-02', '2006-10-03', '2006-10-04', '2006-10-05','2006-10-06', '2006-10-09', '2006-10-10', '2006-10-11','2006-10-12', '2006-10-13',...'2011-12-16', '2011-12-19', '2011-12-20', '2011-12-21','2011-12-22', '2011-12-23', '2011-12-27', '2011-12-28','2011-12-29', '2011-12-30'],dtype='datetime64[ns]', name='Date', length=1323, freq=None)

data.io的更换

原文中使用的 pandas.io.data 已经被弃用了，目前需要重新安装，本文中的使用方法为： import pandas_datareader as pd_data

官网：https://pandas-datareader.readthedocs.io/en/latest/
安装地址：https://pypi.org/project/pandas-datareader/

v0.8.0 是支持 Python 2.7 的最后一个版本，所以务必安装 v0.8.1 及以上版本.

Series 和 DataFrame ：第一步

索引
切片
创建新列

ts = df['Close'][-10:]
ts
----------
Date
2011-12-16    54.431427
2011-12-19    54.601429
2011-12-20    56.564285
2011-12-21    56.635715
2011-12-22    56.935715
2011-12-23    57.618572
2011-12-27    58.075714
2011-12-28    57.520000
2011-12-29    57.874287
2011-12-30    57.857143
Name: Close, dtype: float64

DataFrame是一系列对象的集合。将DataFrame列切割出来将返回一个系列。

type(ts)
----------
pandas.core.series.Series

date = ts.index[5]
date
----------
Timestamp('2011-12-23 00:00:00')

ts[date]
----------
57.61857223510742

ts[5]
----------
57.61857223510742

我们还可以选择多列。

df[['Open', 'Close']].head()
----------Open      Close
Date
2006-10-02  10.728572  10.694285
2006-10-03  10.635715  10.582857
2006-10-04  10.585714  10.768572
2006-10-05  10.647142  10.690000
2006-10-06  10.631429  10.602858

还可以动态添加新列

df['diff'] = df.Open - df.Close
df.head()
----------High        Low       Open  ...     Volume  Adj Close      diff
Date                                         ...
2006-10-02  10.838572  10.614285  10.728572  ...  178159800   9.283074  0.034286
2006-10-03  10.707143  10.455714  10.635715  ...  197677200   9.186352  0.052857
2006-10-04  10.780000  10.451428  10.585714  ...  207270700   9.347562 -0.182858
2006-10-05  10.880000  10.590000  10.647142  ...  170970800   9.279355 -0.042857
2006-10-06  10.720000  10.544286  10.631429  ...  116739700   9.203714  0.028571
[5 rows x 7 columns]

也能动态删除

del df['diff']
df.head()
----------High        Low       Open      Close     Volume  Adj Close
Date
2006-10-02  10.838572  10.614285  10.728572  10.694285  178159800   9.283074
2006-10-03  10.707143  10.455714  10.635715  10.582857  197677200   9.186352
2006-10-04  10.780000  10.451428  10.585714  10.768572  207270700   9.347562
2006-10-05  10.880000  10.590000  10.647142  10.690000  170970800   9.279355
2006-10-06  10.720000  10.544286  10.631429  10.602858  116739700   9.203714

常见的财务计算

Moving Average 动态均线
Returns 收益

close_px = df['Adj Close']mavg = close_px.rolling(40).mean()
mavg[-10:]
----------
Date
2011-12-16    48.202029
2011-12-19    48.168981
2011-12-20    48.138538
2011-12-21    48.134446
2011-12-22    48.128090
2011-12-23    48.123874
2011-12-27    48.128772
2011-12-28    48.122138

Returns 定义为:

$rt=pt−pt−1pt−1=ptpt−1−1r_t = \frac{p_t - p_{t-1}}{p_{t-1}} = \frac{p_t}{p_{t-1}} - 1$

rets = close_px / close_px.shift(1) - 1
rets.head()
----------
Date
2006-10-02         NaN
2006-10-03   -0.010419
2006-10-04    0.017549
2006-10-05   -0.007297
2006-10-06   -0.008152
Name: Adj Close, dtype: float64

或者用 .pct_change() ，得到的结果也是一样的

close_px.pct_change().head()
----------
Date
2006-10-02         NaN
2006-10-03   -0.010419
2006-10-04    0.017549
2006-10-05   -0.007297
2006-10-06   -0.008152
Name: Adj Close, dtype: float64

绘图基础

Series 和 DataFrames 有一个关联的 .plot() 命令（在幕后使用Matplotlib ）

import matplotlib.pyplot as plt
close_px.plot(label='AAPL')
mavg.plot(label='mavg')
plt.legend()
----------
<matplotlib.legend.Legend at 0x182f3a8cac8>

plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8Q1KkfEk-1586415057997)(D:\workdata\2020\机器学习与深度学习\image\image-20200402162523592.png)]

可选：如果有时间的话…

import pandas as pd
import pandas_datareader as pdd
import datetimedf = pdd.data.get_data_yahoo(['AAPL', 'GE', 'GOOG', 'IBM', 'KO', 'MSFT', 'PEP'],start=datetime.datetime(2010, 1, 1),end=datetime.datetime(2013, 1, 1))['Adj Close']
df.head()
----------
Symbols          AAPL         GE        GOOG  ...         KO       MSFT        PEP
Date                                          ...
2009-12-31  26.131752  10.526512  308.832428  ...  19.278732  23.925440  44.622261
2010-01-04  26.538483  10.749147  312.204773  ...  19.292267  24.294369  44.945187
2010-01-05  26.584366  10.804806  310.829926  ...  19.058893  24.302216  45.488274
2010-01-06  26.161509  10.749147  302.994293  ...  19.052124  24.153070  45.033253
2010-01-07  26.113146  11.305734  295.940735  ...  19.004770  23.901886  44.747028
[5 rows x 7 columns]

import matplotlib.pyplot as pltrets = df.pct_change()plt.scatter(rets.PEP, rets.KO)
plt.xlabel('Returns PEP')
plt.ylabel('Returns KO')
----------
Text(0, 0.5, 'Returns KO')

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-u8DgxK7P-1586415057998)(D:\workdata\2020\机器学习与深度学习\image\image-20200402173026615.png)]

pandas的矩阵实际上要好一些，但我这里给出的是一个数组，如果要展示的话，还是用plt.show() 就行了

from pandas.plotting import scatter_matrix
scatter_matrix(rets, diagonal='kde', figsize=(10, 10))
----------
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E6FFEE55C0>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E6FFF3C7F0>,...<matplotlib.axes._subplots.AxesSubplot object at 0x000001E6816D85F8>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E681701B70>]],dtype=object)

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TvkjCnds-1586415057999)(D:\workdata\2020\机器学习与深度学习\image\image-20200402174915997.png)]

corr = rets.corr()
corr
----------
Symbols      AAPL        GE      GOOG       IBM        KO      MSFT       PEP
Symbols
AAPL     1.000000  0.457677  0.533692  0.502091  0.389567  0.473377  0.308979
GE       0.457677  1.000000  0.463039  0.609821  0.594637  0.600791  0.534035
GOOG     0.533692  0.463039  1.000000  0.471161  0.401712  0.473264  0.330564
IBM      0.502091  0.609821  0.471161  1.000000  0.600568  0.646600  0.490903
KO       0.389567  0.594637  0.401712  0.600568  1.000000  0.530160  0.635659
MSFT     0.473377  0.600791  0.473264  0.646600  0.530160  1.000000  0.477452
PEP      0.308979  0.534035  0.330564  0.490903  0.635659  0.477452  1.000000

plt.imshow(corr, cmap='hot', interpolation='none')
plt.colorbar()
plt.xticks(range(len(corr)), corr.columns)
plt.yticks(range(len(corr)), corr.columns);
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-K3AlvvFS-1586415058000)(D:\workdata\2020\机器学习与深度学习\image\image-20200402175002817.png)]

我们经常感兴趣的一件事是预期回报与我们承担的风险之间的关系。这两者之间往往存在一种取舍。这里用 plt.annotate 在散布图上做标记。

plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Risk')
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):plt.annotate(label,xy=(x, y), xytext=(20, -20),textcoords='offset points', ha='right', va='bottom',bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
plt.show()
----------

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TLxaHj7t-1586415058000)(D:\workdata\2020\机器学习与深度学习\image\image-20200402175925739.png)]

数据对齐和处理丢失的值

Join, merge, concatenate
填充 NaNs

让我们从雅虎财经获取更多的证券，并建立一个调整后的收盘价的 DataFrame。

series_list = []
securities = ['AAPL', 'GOOG', 'IBM', 'MSFT']
for security in securities:s = pd.io.data.get_data_yahoo(security, start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2013, 1, 1))['Adj Close']s.name = security # Rename series to match security nameseries_list.append(s)

pd.concat 可以把多个 Series 连接到一个 DataFrame 中。

df = pd.concat(series_list, axis=1)
df.head()
----------AAPL        GOOG         IBM       MSFT
Date
2011-09-30  47.285904  256.558350  130.822800  20.321293
2011-10-03  46.452591  246.834808  129.640747  20.027370
2011-10-04  46.192177  250.012894  130.725555  20.688694
2011-10-05  46.905209  251.407669  132.304108  21.137737
2011-10-06  46.796078  256.393982  135.924973  21.505136

import numpy as np
df.ix[0, 'AAPL'] = np.nan
df.ix[1, ['GOOG', 'IBM']] = np.nan
df.ix[[1, 2, 3], 'MSFT'] = np.nan
df.head()
----------AAPL        GOOG         IBM       MSFT
Date
2011-09-30        NaN  256.558350  130.822800  20.321293
2011-10-03  46.452591         NaN         NaN        NaN
2011-10-04  46.192177  250.012894  130.725555        NaN
2011-10-05  46.905209  251.407669  132.304108        NaN
2011-10-06  46.796078  256.393982  135.924973  21.505136

panda对缺少值的计算提供了强大的支持。

(df.AAPL + df.GOOG).head()
----------
Date
2011-09-30           NaN
2011-10-03           NaN
2011-10-04    296.205070
2011-10-05    298.312878
2011-10-06    303.190060
dtype: float64

一种常用的时间序列缺失值的估算方法是正向填充法。

df.ffill().head()
----------AAPL        GOOG         IBM       MSFT
Date
2011-09-30        NaN  256.558350  130.822800  20.321293
2011-10-03  46.452591  256.558350  130.822800  20.321293
2011-10-04  46.192177  250.012894  130.725555  20.321293
2011-10-05  46.905209  251.407669  132.304108  20.321293
2011-10-06  46.796078  256.393982  135.924973  21.505136