R语言模拟：Cross Validation

作者:量化小白一枚，上财研究生在读，偏向数据分析与量化投资

个人公众号:量化小白上分记

前两篇R语言模拟：Bias Variance Trade-Off与R语言模拟:Bias Variance Decomposition在理论推导和模拟的基础上，对于误差分析中的偏差方差进行了分析。本文在前文的基础上，分析一种常用的估计预测误差进而可以参数优化的方法：交叉验证，并通过R语言进行模拟。

K-FOLD CV

交叉验证是数据建模中一种常用方法，通过交叉验证估计预测误差并有效避免过拟合现象。简要说明CV(CROSS VALIDATION)的逻辑，最常用的是K-FOLD CV，以K = 5为例。

将整个样本集分为K份，每次取其中一份作为Validation Set，剩余四份为Trainset，用Trainset训练模型，然后计算模型在Validation set上的误差，循环k次得到k个误差后求平均，作为预测误差的估计量。

除此之外，比较常用的还有LOOCV，每次只留出一项做Validation，剩余项做Trainset。

参数优化

对于含有参数的模型，可以分析模型在不同参数值下的CV的误差，选取误差最小的参数值。

误区

ESL 7.10.2中提到应用CV的两种方法，比如对于一个包含多个自变量的分类模型，建模中包括两方面，一个是筛选出预测能力强的变量，一个是估计最佳的参数。因此有两种使用CV的方法（以下内容摘自ESL 7.10.2）

1.Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels

2.Using just this subset of predictors, build a multivariate classifier.

3.Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

1. Divide the samples into K CV folds(group) at random.

2. For each fold k = 1,2,...,K

(a) Find a subset of "good" predictors that show fairly strong correlation with the class labels, using all of the samples except those in fold k.

(b) Using just this subset of predictors,build a multivariate classifier,using all of the samples except those in fold K.

(c) Use the classifier to predict rhe class labels for the samples in fold k.

简单来说，第一种方法是先使用全样本筛出预测能力强的变量，仅使用这部分变量进行建模，然后用这部分变量建立的模型通过CV优化参数；第二种方法是对全样本CV后，在CV过程中进行筛选变量，然后用筛选出来的变量优化参数，这样CV中每个循环里得到的预测能力强的变量有可能是不一样的。

我们经常使用的是第一种方法，但事实上第一种方法是错误的，直接通过全样本得到的预测能力强的变量，再进行CV，计算出来的误差一定是偏低的。因为筛选过程使用了全样本，再使用CV，用其中K-1份预测1份，使用的预测能力强的变量中实际上包含了这一份的信息，因此存在前视误差。

（The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of all of the samples. Leaving samples out after the variables have been selected does not correctly mimic the application of the classifier to a completely independent test set, since these predictors “have already seen” the left out samples.）

作者给出一个可以验证的例子：

考虑包含50个样本的二分类集合，两种类别的样本数相同，5000个服从标准正态分布的自变量且均与类别标签独立。这种情况下，理论上使用任何模型得到的误差都应该是50%。

如果此时我们使用上述方法1找出100个与类别标签相关性最强的变量，然后仅对这100个变量使用KNN算法，并令K=1,CV得到的误差仅有3%，远远低于真实误差50%。

作者使用了5-FOLD CV并且计算了CV中每次Validation set 中10个样本的自变量与类别的相关系数，发现此时相关系数平均值为0.28，远大于0。

而使用第二种方法计算的相关系数远低于第一种方法。

模拟

我们通过R语言模拟给出一个通过CV估计最优参数的例子，例子为上一篇右下图的延伸。

样本： 80个样本，20个自变量，自变量均服从[0,1]均匀分布，因变量定义为：

Y = ifelse(X1+X2+...+X10>5,1,0)

使用Best Subset Regression建模（就是在线性回归的基础上加一步筛选最优自变量组合），模型唯一待定参数为Subset Size p，即最优的自变量个数。

通过10-Fold CV计算不同参数p下的预测误差并给出最优的p值，Best Subset Regression可以通过函数regsubsets实现，最终结果如下：

对比教材中的结果

其中，红色线为真实的预测误差，蓝色线为10-FOLD CV计算出的误差，bar为1标准误。可以得出以下结论：

CV估计的误差与实际预测误差的数值大小、趋势变化基本相同，并且真实预测误差基本都在CV1标准误以内。
p = 10附近误差最小，表明参数p应设定为10，这也与样本因变量定义相符。

可以直接运行的R代码

setwd('xxxx')library(leaps) library(DAAG)library(caret)lm.BestSubSet<- function(trainset,k){ lm.sub <- regsubsets(V21~.,trainset,nbest =1,nvmax = 20) summary(lm.sub) coef_lm <- coef(lm.sub,k) strings_coef_lm <- coef_lm x <- paste(names(coef_lm)[2:length(coef_lm)], collapse ='+') formulas <- as.formula(paste('V21~',x,collapse='')) return(formulas)}

# ====================   get error ===============================getError <- function(k,num,modeltype,seeds,n_test){ set.seed(seeds) testset <- as.data.frame(matrix(runif(n_test*21,0,1),n_test))

 Allfx_hat <- matrix(0,n_test,num) Ally <- matrix(0,n_test,num) Allfx <- matrix(0,n_test,num)

 # 模拟 num次    for (i in 1:num){   trainset <- as.data.frame(matrix(runif(80*21,0,1),80))   fx_train <- ifelse(trainset[,1] +trainset[,2] +trainset[,3] +trainset[,4] +trainset[,5]+                        trainset[,6] +trainset[,7] +trainset[,8] +trainset[,9] +trainset[,10]>5,1,0)      trainset[,21] <- fx_train      fx_test <- ifelse(testset[,1] +testset[,2] +testset[,3] +testset[,4] +testset[,5]+                       testset[,6] +testset[,7] +testset[,8] +testset[,9] +testset[,10]>5,1,0)

   testset[,21] <- fx_test 

   # best subset   lm.sub <- lm(formula = lm.BestSubSet(trainset,k),trainset)   probs <- predict(lm.sub,testset[,1:20], type = 'response')

   Allfx_hat[,i] <- probs   Ally[,i] <- testset[,21]   Allfx[,i] <- fx_test

 }  # 计算方差、偏差等

 # irreducible <- sigma^2

 irreducible  <- mean(apply( Allfx - Ally  ,1,var)) SquareBais  <- mean(apply((Allfx_hat - Allfx)^2,1,mean)) Variance <- mean(apply(Allfx_hat,1,var))

 # 回归或分类两种情况 if (modeltype == 'reg'){   PredictError <- irreducible + SquareBais + Variance  }else{   PredictError <- mean(ifelse(Allfx_hat>=0.5,1,0)!=Allfx) } result <- data.frame(k,irreducible,SquareBais,Variance,PredictError) return(result)}

# -------------------- classification -------------------modeltype <- 'classification'num <- 100n_test <- 1000seeds <- 4

all_p <- seq(2,20,1)result <- getError(1,num,modeltype,seeds,n_test)for (i in all_p){ result <- rbind(result,getError(i,num,modeltype,seeds,n_test)) }

# ==================== CV  =========================

fun_cv <- function(trainset,kfold=10,seed_num =3,p){ set.seed(seed_num) folds<-createFolds(y=trainset$V21,kfold)  misrate <- NULL

 for (i in(1:kfold)){

   train_cv<-trainset[-folds[[i]],]   test_cv<-trainset[folds[[i]],]

   model <- lm(formula = lm.BestSubSet(train_cv,p),train_cv)   result  <- predict(model,type='response',test_cv)

   # result    misrate[i] <- mean( ifelse(result>=0.5,1,0) != test_cv$V21) } sderror <- sd(misrate)/(length(misrate)^0.5) misrate <- mean(misrate) result <- data.frame(p,misrate,sderror) return(result) }

plot_error <- function(x, y, sd, len = 1, col = "black") { len <- len * 0.05 arrows(x0 = x, y0 = y, x1 = x, y1 = y - sd, col = col, angle = 90, length = len) arrows(x0 = x, y0 = y, x1 = x, y1 = y + sd, col = col, angle = 90, length = len)}

# ================================   draw ==============================# seed =   9,10,   92,65，114,  10912seed_num =  9trainset <- as.data.frame(matrix(runif(80*21,0,1),80))trainset[,21] <- ifelse(trainset[,1] +trainset[,2] +trainset[,3] +trainset[,4] +trainset[,5]+                         trainset[,6] +trainset[,7] +trainset[,8] +trainset[,9] +trainset[,10]>5,1,0)

resultcv <- fun_cv(trainset,kfold = 10,seed_num,p = 1)for (p in 2:20){ resultcv <- rbind(resultcv,fun_cv(trainset,kfold = 10,seed_num,p))}

png(file = "Cross Validation_large_testset.png")

plot(result$k,result$PredictError,type = 'o',col = 'red',    xlim = c(0,20),ylim = c(0,0.6),xlab = '', ylab ='', lwd = 2)par(new = T)

plot(resultcv$p,resultcv$misrate,type='o',lwd=2,col='blue',ylim = c(0,0.6),xlim = c(0,20),    xlab = 'Subset Size p', ylab = 'Misclassification Error')plot_error(resultcv$p,resultcv$misrate,resultcv$sderror,col = 'blue',len = 1)dev.off()

参考文献

Ruppert D. The Elements of Statistical Learning: Data Mining, Inference, and Prediction[J]. Journal of the Royal Statistical Society, 2010, 99(466):567-567.

公众号后台回复关键字即可学习

回复爬虫         爬虫三大案例实战
回复 Python 1小时破冰入门

回复数据挖掘   R语言入门及数据挖掘
回复人工智能   三个月入门人工智能
回复数据分析师  数据分析师成长之路
回复机器学习      机器学习的商业应用
回复数据科学      数据科学实战
回复常用算法      常用数据挖掘算法

R语言模拟：Cross Validation相关推荐

R语言模拟疫情传播-gganimate包
本文用gganimate包展示模拟疫情数据本文篇幅较长,分为以下几个部分: 前言效果展示小结附录:代码前言前文<R语言模拟疫情传播-RVirusBroadcast>已经介绍了一 ...
R语言-模拟产生统计专业学生的成绩
现在Mayuyu会以一个例子来说明R语言在统计学中的应用.模拟一个高中学生语数外三科的成绩单. 首先认识两个重要的函数,source()和print(),source()函数是用来运行R脚本的,一个R ...
R语言中文社区2018年终文章整理（作者篇）
欢迎关注天善智能,我们是专注于商业智能BI,人工智能AI,大数据分析与挖掘领域的垂直社区,学习,问答.求职一站式搞定! 对商业智能BI.大数据分析挖掘.机器学习,python,R等数据领域感兴趣的同学 ...
r语言数据变量分段_R数据分析：用R语言做meta分析
这里以我的一篇meta分析为例,详细描述meta分析的一般步骤,该例子实现的是效应量β的合并 R包:metafor或meta包,第一个例子以metafor包为例. 1.准备数据集 2.异质性检验 in ...
语言模拟蒲丰问题_R语言小数定律的保险业应用：泊松分布模拟索赔次数
原文链接: 拓端数据科技 / Welcome to tecdattecdat.cn 在保险业中,由于分散投资,通常会在合法的大型投资组合中提及大数定律.在一定时期内,损失"可预测" ...
多元线性回归公式推导及R语言实现
多元线性回归多元线性回归模型实际中有很多问题是一个因变量与多个自变量成线性相关,我们可以用一个多元线性回归方程来表示. 为了方便计算,我们将上式写成矩阵形式: Y = XW 假设自变量维度为N W ...
多元线性回归分析c语言,多元线性回归公式推导及R语言实现
多元线性回归多元线性回归模型实际中有很多问题是一个因变量与多个自变量成线性相关,我们可以用一个多元线性回归方程来表示. 为了方便计算,我们将上式写成矩阵形式: Y = XW 假设自变量维度为N W ...
R语言数据统计分析的基本函数
转载于http://blog.csdn.net/yujunbeta/article/details/8547171 在面对大规模数据时,对数据预处理,获取基本信息是十分必要的.今天分享的就是数据预处理 ...
R语言与数据的预处理
在面对大规模数据时,对数据预处理,获取基本信息是十分必要的.今天分享的就是数据预处理的一些东西. 一.获取重要数据在导入大规模数据时,我们通常需要知道数据中的关键内容:最值,均值,离差,分位数,原点 ...
R语言实用案例分析-1
在日常生活和实际应用当中,我们经常会用到统计方面的知识,比如求最大值,求平均值等等.R语言是一门统计学语言,他可以方便的完成统计相关的计算,下面我们就来看一个相关案例. 1. 背景最近西安交大大数据 ...

R语言模拟：Cross Validation

R语言模拟：Cross Validation相关推荐

最新文章

热门文章