Bayesian kernel machine regression (BKMR) is a popular tool in machine learning literature which flexibly models the relationship between a large number of variables and a particular dependent outcome variable.
Advantages of BKMR through flexibility of kernal function, hierarchical variable selection
Demonstrating the ability of BKMR to model highly correlated predictors through URX variables in the chemical dataset using the Gaussian kernel
Bayesian model diagnostics through effective sample size (ESS) and autocorrelation
We consider the general framework
\[g(\mu_i) = h(z_{i1},...z{iM})+\beta x_i \space i=1,...,n\]
To illustrate how BKMR empowers hierarhical modeling of highly
correlated predictors, we will be using the nhanes
dataset
where the variables are described in the file
nhanes-codebook.txt
. Load this data with the
load
function and specify the data file.
load(file='nhanes1518.rda')
We first explore the URX predictors (i.e. the ones related to
phthalates concentrations), subsetting subset the dataset to include
only the predictors in the weighted quantile sum module, and then filter
out NA
values:
library(tidyverse)
df<-nhanes1518%>%
select(BMXBMI, URXUCR, URXCNP,URXCOP,URXECP,URXHIBP,URXMBP,URXMC1,URXMEP,URXMHBP,URXMHH,RIDAGEYR)%>%na.omit()
nhanes_URX<-nhanes1518%>%
select(BMXBMI, URXUCR, URXCNP,URXCOP,URXECP,URXHIBP,URXMBP,URXMC1,URXMEP,URXMHBP,URXMHH)%>%na.omit()
library(corrplot)
corr_mat=cor(nhanes_URX,method="s")
corrplot(corr_mat) # circles
corrplot(corr_mat, method = 'color', order = 'alphabet') # squares
corrplot(corr_mat, # colorful number
addCoef.col = 1, # Change font size of correlation coefficients
number.cex = 0.5)
corrplot.mixed(corr_mat, lower = 'shade', upper = 'pie', order = 'hclust')
We first start exploring the data by plotting a correlation matrix
between the variables of interest. We can use the corrplot
function within the corrplot()
package, and adjust the
style to fit our aesthetics of desire. We see that the predictors are
highly correlated, especially between URXMHH and URXECP, and URXMHBP and
URXMBP.
To fit the BKMR model, we use the kmbayes
function. This
function implements the Markov chain Monte Carlo (MCMC) algorithm.
#install.packages("bkmr")
library(bkmr)
The argument iter
indicates the number of iterations of
the MCMC sampler; y
is the vector of outcomes,
Z
is a matrix of exposures (each column is an exposure
variable); X
is a matrix of covariates (each column is a
covariate); verbose
indicates whether interim output
summarizing the progress of the model fitting should be printed; and
varsel
indicates whether to conduct variable selection on
the predictors
set.seed(111)
dat <- SimData(n = 50, M = 4)
y <- dat$y # bmi
Z <- dat$Z # chemical mixtures
X <- dat$X # age
fitkm <- kmbayes(y = y, Z = Z, X = X, iter = 10000, verbose = FALSE, varsel = TRUE)
# set.seed(111)
# fitkm_corr <- kmbayes(y = d2$y, Z = d2$Z, X = d2$X, iter = 10000, varsel = TRUE, verbose = FALSE)
# fitkm_hier <- kmbayes(y = d2$y, Z = d2$Z, X = d2$X, iter = 10000, varsel = TRUE,
# groups = c(1,2,1,3), verbose = FALSE)
set.seed(111)
sampled_data <- df %>% sample_frac(0.01)
y<-sampled_data$BMXBMI # bmi
Z <- sampled_data%>%select(-BMXBMI) # chemical mixtures
# group 1: URXMBP, URXMIB, URXMC1, URXMEP, URXMZP
# group 2: URXECP, URXMHH, URXMOH, URXMHP
X <- sampled_data%>%select(RIDAGEYR)
# log every term
fitkm_corr <- bkmr::kmbayes(y = y, Z = Z, X = X, iter = 2000, varsel = TRUE, verbose = FALSE)
#fitkm_hier <- kmbayes(y = y, Z = Z, X = X, iter = 10000, varsel = TRUE, groups = c(1,2,1,3), verbose = FALSE)
sel<-seq(0,1000,by=1)
bkmr::TracePlot(fit = fitkm_corr, par = "beta", sel=sel)
pred.resp.univar <- bkmr::PredictorResponseUnivar(fit = fitkm_corr)
library(ggplot2)
ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se, ymax = est + 1.96*se)) +
geom_smooth(stat = "identity") +
facet_wrap(~ variable) +
ylab("h(z)")