使用决策树预测大盘指数

在和坛友们经过友好讨论以后，我认为时间对于预测大盘来说确实是一个不可忽视的信息，粗暴地交叉验证获得的准确率，以及依靠这个准确率进行的调参，很容易过拟合。
以及，我之前发现一个问题，如果单纯地把30天后的涨跌算作大盘指数涨跌，很容易让模型拟合噪音，于是我设定了10%的阈值，高于这个阈值的时候才认为大盘在涨，低于这个阈值的时候才认为大盘在跌。当然，更好的办法是使用mean in 30 days来作为指数涨跌的依据。不过个人觉得设定阈值已经足够表现大盘了。
于是我熬夜改完了代码，发现这样一件事，决策树对于大盘的预测很稳健性很差。之前的帖子说决策树相对比较有效一定是过拟合了……嗯。
还有一点，决策树认为有一些因子重要性为0，也有可能是我们选取数据的时候，没选取到足够预测大盘涨跌的有效数据。

import numpy as np
import pandas as pd
from CAL.PyCAL import Date
from CAL.PyCAL import Calendar
from CAL.PyCAL import BizDayConvention
from sklearn.tree import DecisionTreeClassifier
import seaborn as sns
import talib

又是处理数据……头疼

def max_arr(arr):
2try:
3se=talib.MAX(np.array(arr),30)
4except:
5se=np.array([np.nan]*len(arr))
6return se
7
def min_arr(arr):
8try:
9se=talib.MIN(np.array(arr),30)
10except:
11se=np.array([np.nan]*len(arr))
12return se
13
fields = ['tradeDate','closeIndex', 'highestIndex','lowestIndex', 'turnoverVol','CHG','CHGPct']
14
stock = '000300'
15
#tradeDate是交易日、closeIndex是收盘指数、highestIndex是当日最大指数，lowestIndex是当日最小指数，CHG是涨跌
16
index_raw = DataAPI.MktIdxdGet(ticker=stock,beginDate=u"2006-03-01",endDate=u"2014-12-01",field=fields,pandas="1")
17
index_date = index_raw.set_index('tradeDate')
18
index_date = index_date.dropna()
19
index_date['max_difference'] = index_date['highestIndex'] - index_date['lowestIndex']
20
index_date['max_of_30day']=index_date.apply(max_arr)['highestIndex']
21
index_date['min_of_30day']=index_date.apply(min_arr)['lowestIndex']
22
index_date['max_difference_of_30day']=index_date.apply(max_arr)['max_difference']
23
index_date['closeIndex_after30days']=np.nan
24
index_date['closeIndex_after30days'][0:-30]=np.array(index_date['closeIndex'][30:])
25
index_date = index_date.dropna()   #去掉前30个和后30个无效的数据。
26
# lables_raw = index_date['closeIndex_after30days'] #提取出需要预测的数据
27
# lables = (index_date['closeIndex_after30days'] > index_date['closeIndex']*1.1) #为分类处理数据，判断30天后的收盘价是否大于今日收盘价
28
# index_date['closeIndex_after30days'] = index_date['closeIndex_after30days'] - index_date['closeIndex']
29
index_date[(index_date['closeIndex_after30days']<index_date['closeIndex']*1.1)&
30(index_date['closeIndex_after30days']>index_date['closeIndex']*0.9)]=np.nan
31
index_date = index_date.dropna()
32
index_date['closeIndex_after30days'][index_date['closeIndex_after30days']>index_date['closeIndex']*1.1] = 'up'
33
index_date['closeIndex_after30days'][index_date['closeIndex_after30days']<index_date['closeIndex']*0.9] = 'down'
34
index_date = index_date.dropna()
35
lables = index_date['closeIndex_after30days']
36

37
features = index_date.drop(['closeIndex_after30days'],axis = 1) #在特征值中去掉我们要预测的数据

from sklearn import cross_validation
2
from sklearn import preprocessing
3
from sklearn.tree import DecisionTreeClassifier
4

5
sp = int(len(features)*5./6.)
6

7
features_1 = features[:sp]
8
features_2 = features[sp:]
9
lables_1 = lables[:sp]
10
lables_2 = lables[sp:]
11

12
scaler1 = preprocessing.StandardScaler().fit(features_1)
13
features_scaler_1 = scaler1.transform(features_1)
14
scaler2 = preprocessing.StandardScaler().fit(features_2)
15
features_scaler_2 = scaler2.transform(features_2)
16
#上面4行代码用来标准化数据
17

18
X_train,X_test, y_train, y_test = cross_validation.train_test_split(features_scaler_1, lables_1, test_size = 0.2, random_state = 0)

在未调参之前，我们先获取一次准确率：得到0.97,0.61

clf_tree = DecisionTreeClassifier(random_state=0)
clf_tree.fit(X_train, y_train)
print "样本内预测准确率为：%0.2f" % (clf_tree.score(X_test, y_test))
print "样本外预测准确率为：%0.2f" % (clf_tree.score(features_scaler_2, lables_2))

然后调C值，这里我是先让max_depth在1~100的range跑，然后作图

i_list = []
score_list1 = []
score_list2 = []
for i in range(1,100,1):i=i/1.clf_tree = DecisionTreeClassifier(max_depth =i )   #使用决策树clf_tree.fit(X_train, y_train)i_list.append(i)score_list1.append(clf_tree.score(X_test, y_test))score_list2.append(clf_tree.score(features_scaler_2, lables_2))score_list_df =  pd.DataFrame({'max_depth':i_list,'in_sets':score_list1,'out_of_sets':score_list2})
score_list_df.plot(x='max_depth',title='score change with max_depth')

然后是min_samples_leaf值，同理。这里是从0.1到10变动范围

i_list = []
score_list1 = []
score_list2 = []
for i in range(1,200,1):i=i/5.clf_tree = DecisionTreeClassifier(min_samples_leaf  = i ) clf_tree.fit(X_train, y_train)i_list.append(i)score_list1.append(clf_tree.score(X_test, y_test))score_list2.append(clf_tree.score(features_scaler_2, lables_2))score_list_df =  pd.DataFrame({'min_samples_leaf':i_list,'in_sets':score_list1,'out_of_sets':score_list2})
score_list_df.plot(x='min_samples_leaf',title='score change with min_samples_leaf')

然后是min_samples，这里选用2~50

i_list = []
score_list1 = []
score_list2 = []
min_samples  =  range(2,100,1)
for i in min_samples :clf_tree = DecisionTreeClassifier(min_samples_split  = i ) clf_tree.fit(X_train, y_train)i_list.append(i)score_list1.append(clf_tree.score(X_test, y_test))score_list2.append(clf_tree.score(features_scaler_2, lables_2))score_list_df =  pd.DataFrame({'min_samples_split':i_list,'in_sets':score_list1,'out_of_sets':score_list2})
score_list_df.plot(x='min_samples_split',title='score change with min_samples_split')

然后是min_weight_fraction_leaf……自己看图。发现调整这个参数对模型影响意义并不大。

i_list = []
score_list1 = []
score_list2 = []
min_weight  =  [x /1000. for x in range(1,500,1)]
for i in min_weight :clf_tree = DecisionTreeClassifier(min_weight_fraction_leaf  = i ) clf_tree.fit(X_train, y_train)i_list.append(i)score_list1.append(clf_tree.score(X_test, y_test))score_list2.append(clf_tree.score(features_scaler_2, lables_2))score_list_df =  pd.DataFrame({'min_weight_fraction_leaf':i_list,'in_sets':score_list1,'out_of_sets':score_list2})
score_list_df.plot(x='min_weight_fraction_leaf',title='score change with min_weight')

知道了大致参数最优范围以后，我们使用grisearchCV在这个范围内找到最优解。

from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import ShuffleSplit
params = {'max_depth':range(1,10,1),'min_samples_split':range(2,10,1),'min_samples_leaf':range(1,20,1)  }clf_tree = DecisionTreeClassifier() grid = GridSearchCV(clf_tree, params)
grid = grid.fit(X_train, y_train)
print grid.best_estimator_

然后在最优解的基础上再次计算一次准确率

clf_tree = DecisionTreeClassifier(random_state=0,max_depth = 9 , min_samples_split=2 )
clf_tree.fit(X_train, y_train)
print "样本内预测准确率为：%0.2f" % (clf_tree.score(X_test, y_test))
print "样本外预测准确率为：%0.2f" % (clf_tree.score(features_scaler_2, lables_2))

i_list = []
score_list1 = []
score_list2 = []
for i in xrange(200):num_now = len(features) + i -300features_1 = features[:num_now]features_2 = features[num_now:num_now+100]lables_1 = lables[:num_now]lables_2 = lables[num_now:num_now+100]scaler1 = preprocessing.StandardScaler().fit(features_1)features_scaler_1 = scaler1.transform(features_1)scaler2 = preprocessing.StandardScaler().fit(features_2)features_scaler_2 = scaler2.transform(features_2)X_train,X_test, y_train, y_test = cross_validation.train_test_split(features_scaler_1, lables_1, test_size = 0.2, random_state = 0)clf_tree = DecisionTreeClassifier(random_state=0,max_depth = 9 , min_samples_split=2 ) clf_tree.fit(X_train, y_train)i_list.append(features_2.index[0])score_list1.append(clf_tree.score(X_test, y_test))score_list2.append(clf_tree.score(features_scaler_2, lables_2))score_list_df =

然后我们通过返回数据重要性，来看看对我们决策树哪些特征影响最大。

通过下方的表格（和直方图）发现影响最大的特征主要是30日的最小值，30日的最大值以及30日的最大日波动，然后是当日交易量。可以简单认为，前30日的波动幅度，最大震荡能一定程度上反映出未来30日后的市场涨跌趋势。

使用决策树预测大盘指数相关推荐

使用AdaBoost预测预测大盘涨跌
继使用SVM预测大盘涨跌, 使用决策树预测大盘涨跌后的第三个预测大盘涨跌的模型.包括调参的过程以及模型稳健性验证. 经过调参之后,预测准确率可以达到平均90%,上下波动范围约10%. 看到预测的准确率 ...
推荐系统笔记：决策树回归树
决策树和回归树经常用于数据分类. 决策树是为那些因变量(target,label)是分类的情况而设计的,而回归树是为那些因变量(target,label)是数值的情况而设计的. 在讨 ...
EXPMA指标基础算法以及计算公式
参考:ecpma指数-百度百科指标概述 EXPMA指标简称EMA,中文名字:指数平均数指标或指数平滑移动平均线,一种趋向类指标,从统计学的观点来看,只有把移动平均线(MA)绘制在价格时间跨度的中点, ...
【机器学习】采用信息增益、信息增益率、基尼指数来建造决策树。
目录一.创建数据集二.构造决策树(诊断是否复发乳腺癌) 1.信息增益生成决策树 (ID3算法) 信息熵信息增益(ID3算法) 2.信息增益率决策树(C4.5) 3.基尼指数(CART算法 - 分 ...
决策树--信息增益，信息增益比，Geni指数
决策树是表示基于特征对实例进行分类的树形结构从给定的训练数据集中,依据特征选择的准则,递归的选择最优划分特征,并根据此特征将训练数据进行分割,使得各子数据集有一个最好的分类的过程. 决策树算法3要 ...
决策树--信息增益、信息增益比、Geni指数的理解
决策树是表示基于特征对实例进行分类的树形结构从给定的训练数据集中,依据特征选择的准则,递归的选择最优划分特征,并根据此特征将训练数据进行分割,使得各子数据集有一个最好的分类的过程. 决策树算法3要 ...
决策树（信息熵、增益率、基尼指数）
目录前言一.决策树是什么? 二.实验过程 1.选择数据集中各个决策属性的优先级 1.1信息熵 1.2增益率 1.3基尼指数 2.决策树的构造 2.1创建决策树: 2.2准备数据: 2.3.读取和保 ...
决策树信息增益|信息增益比率|基尼指数实例
今天以周志华老师的西瓜为例,复盘一下三种决策树算法. 文章目录信息增益(ID3算法) 信息增益比率(C4.5算法) 基尼指数(CART算法) 数据: 信息增益(ID3算法) 信息熵表示信息的混乱程度 ...
决策树：什么是基尼系数（“杂质增益指数系数”辨析）
决策树:什么是基尼系数在我翻译学习这篇Random Forests for Complete Beginners的时候,对基尼系数和它相关的一些中文表达充满了疑问,查了一些资料以后,完成了这篇文章. ...

使用决策树预测大盘指数

在和坛友们经过友好讨论以后，我认为时间对于预测大盘来说确实是一个不可忽视的信息，粗暴地交叉验证获得的准确率，以及依靠这个准确率进行的调参，很容易过拟合。

于是我熬夜改完了代码，发现这样一件事，决策树对于大盘的预测很稳健性很差。之前的帖子说决策树相对比较有效一定是过拟合了……嗯。

又是处理数据……头疼

在未调参之前，我们先获取一次准确率：得到0.97,0.61

然后调C值，这里我是先让max_depth在1~100的range跑，然后作图

然后是min_samples_leaf值，同理。这里是从0.1到10变动范围

然后是min_samples，这里选用2~50

知道了大致参数最优范围以后，我们使用grisearchCV在这个范围内找到最优解。

然后在最优解的基础上再次计算一次准确率

然后我们通过返回数据重要性，来看看对我们决策树哪些特征影响最大。

使用决策树预测大盘指数相关推荐

最新文章

热门文章