DSA为变量选择的方法,通过迭代方法实现多暴露数据的筛选,主要步骤为:Deletion/Substitution/Addtion,具体为1)移除所选择的变量;2)用未选择的变量替换已选择的变量;3)选择新的变量。通过五折交叉验证计算预测方程的均方根误差(L2损失函数)为最小时,确定纳入模型的变量个数及具体类型,实现变量的筛选。为了确保选择的稳定性,一般切换不同种子数运行DSA50次。 然后进行多暴露变量的二项式GLM评估,所纳入的变量至少在DSA模型的6%(n≥3次)或10%(n≥5次)中被选择,最后的模型中需验证多重共线性。

与传统的线性回归方程相比,该方程降低了假阳性率,各变量间实现互相调整并可探讨化学物间交互作用,但对检出率较低的化学物的交互作用探讨时作用有限。

R实现

1.软件安装

devtools::install_github("romainkp/modelUtils")
devtools::install_github("romainkp/DSA")

2.方程参数

Usage
DSA(formula, data, id = 1:nrow(data), family = gaussian, weights = NULL,
candidate.rank = NULL, rank.cutoffs =  NULL, maxsize, maxorderint,
maxsumofpow, Dmove=TRUE, Smove=TRUE, usersplits = NULL, userseed = NULL,
vfold = 5, nsplits = 1, model.matrices = FALSE, silent = TRUE, ...)Arguments
formula
a symbolic description of the base model which specifies the independent/response variable(s) and all terms forced in the final model. Typically, formula is set to Y ~ 1 when no terms are forced in the final model. Currently supported outcomes are continuous, binomial (binary with 0s and 1s or a two-column matrix of successes in the first column and failures in the second) and multinomial (categorical specified as factors). Dependent/Explanatory candidate variables can be continuous or categorical (factors). Only polynomial main terms and polynomial interaction terms specified within the I() subroutine can be forced in the final model, i.e. interactions between a factor and other variables are currently not suported.data
a non-optional data frame containing both the response variable(s) as well as the candidate covariates to be considered in the data-adaptive estimation procedure.id
a vector identifying each independent experimental unit in the data with a unique value, i.e. repeated measurements are allowed in data. The length of id must correspond to the number of observations (nrow(data)). The default value for id is 1:nrow(data) which indicates that all observations are independent.family
currently either binomial, multinomial, or gaussian. Used to determine whether logistic, multinomial logistic (softmax function), or general linear models should be considered.weights
a matrix of real numbers whose number of rows is the number of data splits plus one (i.e., vfold+1 or nrow(usersplits)+1) and whose number of columns is the number of observations. The first vfold or nrow(usersplits) rows contain the weights to be applied to each observation (whether in the training or validation set) at each split of the data. The last row contains the weights to be applied to each observation in the learning set. The argument weights is ignored if the value of weights is NULL (default).candidate.rank
a vector of named real numbers which ranks all candidate variables in data. Missing values are not allowed. Each element's name must correspond to the name of one of the candidate variables in data. The candidate variables' ranks are used to determine which set(s) of covariates in data will be considered in the DSA call based on the argument rank.cutoffs. When candidate.rank is NULL (default), the variables in data are ranked with a call to candidateReduction. Note that the candidate dummy variables automatically created for all factors in data receive different ranks by default while their ranks are identical when candidate.rank is user-supplied.rank.cutoffs
a vector of real numbers which is used with candidate.rank to define the set(s) of candidate variables in data that will be considered in the data-adaptive estimation procedure. By default, rank.cutoffs is set to NULL, which means that all candidate variables supplied in data are considered in the DSA call.maxsize
an integer strictly positive which limits the data-adaptive estimation procedure to candidate linear models with a maximum number of terms (excluding the intercept) lower or equal to maxsize. The argument maxsize must be larger or equal to the number of terms forced into the model through formula.maxorderint
an integer strictly positive which limits the data-adaptive estimation procedure to candidate linear models with a maximum order of interactions lower or equal to maxorderint.maxsumofpow
an integer larger or equal to maxorderint which limits the data-adaptive estimation procedure to candidate generalized linear models with terms involving variables whose powers sum up to a value lower or equal to maxsumofpow.Dmove
a logical parameter which specifies whether deletion moves are used to search through the space of candidate estimators. By default, deletion moves are allowed.Smove
a logical parameter which specifies whether substitution moves are used to search through the space of candidate estimators. By default, substitution moves are allowed.usersplits
an optional matrix of 0s or 1s whose number of columns is the number of observations. Each row indicates how the learning set is split in a training (0s) and validation (1s) set. The default value for usersplits is NULL which indicates that the data should be randomely split based on the v-fold splitting scheme corresponding to the values for userseed, vfold, nsplits, and id.userseed
a single value interpreted as an integer and used to set the seed of the random number generator which determines the v-fold data splits. If userseed=NULL (default) then the current seed from the current R session is used. The argument userseed is ignored if usersplits is non-null.vfold
an integer strictly larger than 1 specifying the number of splits (i.e, v) in the default v-fold splitting scheme for the cross-validation procedure. The default value for vfold is 5. The argument vfold is ignored if the argument usersplits is non-null.nsplits
an integer larger than 1 indicating the number of v-fold splits for the cross-validation procedure. The default value for nsplits is 1. This argument is ignored if the user provides the data splitting scheme with usersplits.model.matrices
a logical variable indicating that only the design and response matrices should be returned.silent
if FALSE then intermediate messages will be printed to standard output showing the progress of the computations (message level set to 0). if TRUE then no intermediate messages will be printed to standard output (message level set to -1). One can further control the messaging level to be printed to standard output with a call to the setDSAMessageLevel subroutine preceding a call to the DSA routine where the argument silent is not referenced....
currently used internally to recursively call the DSA.

3.示例

library(DSA)## an example using the state census data. (gaussian)
data(state)
state.data <- as.data.frame(state.x77)
colnames(state.data) <- unlist(lapply(strsplit(colnames(state.data), " "),function(x) paste(x, collapse = "")))
res <- DSA(Murder ~ 1, data = state.data, maxsize = 5, maxsumofpow = 2, maxorderint = 2)
residuals(res)
coefficients(res)
summary(res)res <- DSA(Murder ~ 1, data = state.data, maxsize = 5, maxsumofpow = 2, maxorderint = 2,nsplits=10,silent=FALSE)
residuals(res)
coefficients(res)
summary(res)## an example using weights (gaussian).
ddir <- paste(system.file(package = "DSA"), "/", "data", "/", sep = "")
x <- read.table(paste(ddir, "smalllymphRoX.txt", sep = ""), nrow=240)
y <- read.table(paste(ddir, "logT1lymph.txt", sep = ""))
weights <- as.matrix(read.table(paste(ddir, "lymphWeights_111505", sep = "")))
colnames(y) <- "loglymph"
data <- cbind(x, y)res <- DSA(loglymph ~ 1, data = data, maxsize = 5, maxorderint = 2, maxsumofpow = 2,weights = weights)
#pdf(file = "plot.pdf")
plot(res)
#dev.off()
summary(res)## an example using simulated data (binomial, logistic link).
n <- 1000
W <- cbind(rnorm(n), rnorm(n) < 1, rnorm(n) < 2, rnorm(n, 2, 4), runif(n),rcauchy(n), rlogis(n) < .1, rnorm(n) < .1, rnorm(n, 120, 10),rnorm(n, 66, 2))Y <- 22 + .5*W[,1] + .02*W[,1]^2 + .01*W[,1]*W[,2] + 2*W[,3] + .7*W[,4]^2
Y <- as.matrix(as.integer(Y - mean(Y))/sd(Y))
Y <- as.matrix(as.integer(Y < rlogis(n)))
colnames(W) <- paste("V", 1:ncol(W), sep = "")
colnames(Y) <- "Y"
data <- as.data.frame(cbind(W, Y))res <- DSA(Y ~ 1, data = data, family = binomial, maxsize = 8, maxorderint = 2, maxsumofpow = 3)
plot(res)
summary(res)## an example with factors.
n <- 1000
inc <- round(runif(n,1,4))
edu <- round(runif(n,1,4))W <- data.frame(income =as.factor(c("10000", "10000.35000", "35000.100000", "100000")[inc]),age = round(runif(n, 20, 60)),weight = rnorm(n, 140, 30),degree =as.factor(c("highschool", "highschool.grad", "college", "post.graduate")[edu]),male = rbinom(10, size = 1, prob = .5))Vertical.Leap <- 8 + ifelse(inc == 1, 2, 0) + (60 - W$age) * .1 + W$male * 4 + rnorm(n, 0, 4)data <- cbind(W, Vertical.Leap)
res <- DSA(Vertical.Leap ~ 1, data = data, maxsize = 8, maxorderint = 2, maxsumofpow = 2)
plot(res, plot.compare = TRUE)
summary(res)## now add some forced terms.
res <- DSA(Vertical.Leap ~ degree, data = data, maxsize = 8, maxorderint = 2, maxsumofpow = 2)
plot(res, plot.compare = TRUE)
summary(res)##
## an example using binomial - two column outcome and forced terms
##
n <- 1000
W <- cbind(rnorm(n), rnorm(n) < 1, rnorm(n) < 2, rnorm(n, 2, 4), runif(n),rcauchy(n), rlogis(n) < .1, rnorm(n) < .1, rnorm(n, 120, 10),rnorm(n, 66, 2))Y <- 10 + .5*W[,1] + .02*W[,1]^2 + .01*W[,1]*W[,2] + 2*W[,3] + .7*W[,4]^2
Y <- as.matrix(as.integer(Y - mean(Y))/sd(Y))
trials <- rpois(n, lambda = 20)
successes <- sapply(1:n, function(i) {rbinom(1, size = trials[i], prob = pnorm(Y[i]))
})
failures <- trials - successes
colnames(W) <- paste("V", 1:ncol(W), sep = "")
data <- as.data.frame(cbind(W, "successes" = successes, "failures" = failures))res <- DSA(cbind(successes, failures) ~ 1, data = data, family = binomial, maxsize = 8,maxorderint = 2, maxsumofpow = 3)
summary(res)
plot(res)## now add some forced terms
res <- DSA(cbind(successes, failures) ~ V1 + I(V4^2), data = data, family = binomial, maxsize = 8,maxorderint = 2, maxsumofpow = 3)
summary(res)
plot(res)## with user-specified dimension reduction
res <- DSA(cbind(successes, failures) ~ V1, data=data,family=binomial,  rank.cutoffs=0.1, maxsize = 5, maxorderint = 1, maxsumofpow = 2,userseed=29,nsplits=2,silent=FALSE)
summary(res)
res$candidate.vars
res$selected.vars## with dimension reduction based on cross-validation
res <- DSA(cbind(successes, failures) ~ V1, data=data,family=binomial,  rank.cutoffs=c(0.1,0.2,0.6), maxsize = 5, maxorderint = 1, maxsumofpow = 2,userseed=29)
summary(res)
res$candidate.vars
res$selected.vars## now a multinomial model.
NN <- 1000
W <- cbind(rep(1, NN), runif(NN, 0, 5), rnorm(NN, 5, 2))PY <- apply(W, 1, function(x) {coefs <- matrix(c(0,0,0,-9, 2, .8,-6, 1, .7,-4, .5, .6), nrow = 4, ncol = 3, byrow = TRUE)denom <- sum(apply(coefs, 1, function(cc) {exp(x %*% cc)}))as.numeric(apply(coefs, 1, function(cc) {exp(x %*% cc)/denom}))
})## add some columns to W which don't matter
W <- cbind(W, runif(NN, 0, 5), rnorm(NN, 5, 2))## turn them into counts.
Y <- apply(PY, 2, function(x) {ru <- runif(1)if (ru <= x[1])0else if (ru <= x[2] + x[3])1else if (ru <= x[3] + x[4])2else3
})dta.2 <- data.frame(W[,-1], "Y" = factor(Y))
levels(dta.2[,"Y"]) <- c("B","A","D","C")
res <- DSA(Y ~ 1, family=multinomial, data = dta.2, maxsize = 5,maxorderint = 2, maxsumofpow = 2,vfold = 5, userseed = 87,model.matrices = FALSE, nsplits=1)
res
summary(res)## Another example with the state census data where deletion and/or substitution moves are inhibited
res1 <- DSA(Murder ~ 1, data = state.data, maxsize = 10, maxsumofpow = 2, maxorderint = 2)
summary(res1)res2 <- DSA(Murder ~ 1, data = state.data, maxsize = 10, maxsumofpow = 2, maxorderint = 2,Dmove=FALSE)
summary(res2)res3 <- DSA(Murder ~ 1, data = state.data, maxsize = 10, maxsumofpow = 2, maxorderint = 2,Smove=FALSE)
summary(res3)res4 <- DSA(Murder ~ 1, data = state.data, maxsize = 10, maxsumofpow = 2, maxorderint = 2,Dmove=FALSE,Smove=FALSE)
summary(res4)

Ref:

Granum, Berit, et al. "Multiple environmental exposures in early-life and allergy-related outcomes in childhood." Environment International 144 (2020): 106038.

https://www.rdocumentation.org/packages/partDSA/versions/0.9.14/topics/partDSA

Deletion/Substitution/Addtion(DSA)的R实现相关推荐

  1. python 编辑距离_最小编辑距离(Levenshtein)的 Python 实现

    一直以为这个算法实现起来的代码量很高,直到最近刷 <Speech and Language Processing>,用动态规划做起来,简单.优雅. 算法原理 字符串 X,Y 的长度分别是 ...

  2. 出生队列研究中的暴露组学应用

    Applying the exposome concept in birth cohort research- a review 类别:统计学习方法 时间:20210630 1.背景 大量可改善的疾病 ...

  3. NLP-文本处理:拼写纠错【非词(编辑距离)、真词(编辑距离...)候选词 -> “噪音通道模型”计算候选词错拼成待纠错词的似然概率 -> N-gram模型评估候选词组成的语句合理性】

    一.贝叶斯公式 1.单事件 P(Ax∣B)P(A_x|B)P(Ax​∣B)=P(AxB)P(B)=P(B∣Ax)×P(Ax)P(B)=P(B∣Ax)×P(Ax)∑i=0n[P(B∣Ai)∗P(Ai)] ...

  4. Linux Shell Tips小技巧

    文章目录 sed 指定行 删除文本 替换文本 小技巧 查找N天内修改文件 Shell写R语言 makefile写shell bad interpreter错误 替换换行符为空格 压缩并打包目录 重定向 ...

  5. 今日arXiv精选 | 11篇EMNLP 2021最新论文

     关于 #今日arXiv精选  这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Does Vision-and-Language Pretraining I ...

  6. 编辑距离WER/CER计算的一种python实现

    WER(word error rate)经常作为语音识别任务的性能评测指标,WER的计算公式,直接从网上粘贴过来了. 一些语音识别框架(如:Kaldi.ESPNet等)中,都会包含wer的计算方法,其 ...

  7. python--计算两个中文字符串的编辑距离

    #计算两个中文字符串的编辑距离 def levenshtein(string1,string2):if len(string1) > len(string2):string1,string2 = ...

  8. python 编辑距离_编辑距离(Levenshtein距离)详解(附python实现)

    编辑距离定义: 编辑距离,又称Levenshtein距离,是指两个字串之间,由一个转成另一个所需的最少编辑操作次数. 许可的编辑操作包括:将一个字符替换成另一个字符,插入一个字符,删除一个字符. 例如 ...

  9. 秀一下以前搜房soufun发贴机的发帖群发日志!!呵呵..

    秀一下以前搜房soufun论坛发贴机的发帖群发日志!!呵呵.. QQ 1163551688 "2009-11-28 18:00:06 562","qwerty00789& ...

最新文章

  1. iOS 9应用开发教程之ios9中实现按钮的响应
  2. 小马拉大车,无线网络优化
  3. vb鼠标涂鸦板的制作
  4. uni map 实时记录轨迹_国际学校纷纷引进MAP考试系统,到底有什么好处?
  5. 【异常(待解决)】org.apache.http.NoHttpResponseException: api.weixin.qq.com:443 failed to respond...
  6. ip字符串转换 linux,Linux网络编程入门
  7. 1159 最大全0子矩阵
  8. C++ using关键字作用总结
  9. draw9patch做一个中心不变形的图片
  10. 双目摄像头的帧同步输入fsync信号_读源码长知识 | Android卡顿真的是因为”掉帧“?...
  11. Java学习笔记之JDBC和连接池
  12. Flask 项目打包 线上部署
  13. 2021年最新版裁判文书逆向
  14. 基础知识(三),OSI七层协议、数据传输过程、数据的封装与解封装、IP抓包分析、交换机、路由器、ARP协议、TRUNK中继、VLAN、DHCP中继、ICMP协议、三层交换机
  15. element tree 父级勾选子级也勾选,子级勾选默认父级也勾选, 子级取消勾选不影响父级勾选(前端)
  16. C/C++ 延时函数 (标准库)
  17. Bert模型详解和训练实例
  18. hp6960无法连接计算机,支持多种打印方式 惠普OfficeJet Pro 6960评测
  19. 1个人每天5小时投资3万,年收益30万,你还不心动?
  20. 英语——形容词(一)

热门文章

  1. 学习Linux命令(16)
  2. 2022-1-16牛客C++项目——Linux多进程编程——进程间通信
  3. 2022-1-8数据库期末复习提纲(三)
  4. Sensor Demoasic (CFA)IP仿真实例
  5. FEN Self-Supervised Feature Enhancement Networks for Small Object Detection in Noisy Images
  6. java中怎么给方法加锁_Java中,我会用ArrayList,怎么还要会用CopyOnWriteArrayList
  7. arduino笔记27:mh-sensor-series + 土壤传感器
  8. 电商订单表的设计mysql_电商平台数据库中订单表的设计为什么要有订单号?
  9. 泛世纪上大量最新的录制课程啊。欢迎来下载
  10. php进销存 手机版_银鱼进销存app下载-银鱼进销存手机版 v1.4.6