Work in progress: Bayesian Kernel Machine Regression

Background & Introduction

Bayesian kernel machine regression (BKMR) is a popular tool in machine learning literature which flexibly models the relationship between a large number of variables and a particular dependent outcome variable.

Topics

Advantages of BKMR through flexibility of kernal function, hierarchical variable selection
Demonstrating the ability of BKMR to model highly correlated predictors through URX variables in the chemical dataset using the Gaussian kernel
Bayesian model diagnostics through effective sample size (ESS) and autocorrelation

1. Definitions

We consider the general framework

\[g(\mu_i) = h(z_{i1},...z{iM})+\beta x_i \space i=1,...,n\]

2. Exploratory Data Analysis

To illustrate how BKMR empowers hierarhical modeling of highly correlated predictors, we will be using the nhanes dataset where the variables are described in the file nhanes-codebook.txt. Load this data with the load function and specify the data file.

load(file='nhanes1518.rda')

We first explore the URX predictors (i.e. the ones related to phthalates concentrations), subsetting subset the dataset to include only the predictors in the weighted quantile sum module, and then filter out NA values:

library(tidyverse)
df<-nhanes1518%>%
  select(BMXBMI, URXUCR, URXCNP,URXCOP,URXECP,URXHIBP,URXMBP,URXMC1,URXMEP,URXMHBP,URXMHH,RIDAGEYR)%>%na.omit()
nhanes_URX<-nhanes1518%>%
  select(BMXBMI, URXUCR, URXCNP,URXCOP,URXECP,URXHIBP,URXMBP,URXMC1,URXMEP,URXMHBP,URXMHH)%>%na.omit()

library(corrplot)
corr_mat=cor(nhanes_URX,method="s")
corrplot(corr_mat) # circles

corrplot(corr_mat, method = 'color', order = 'alphabet') # squares

corrplot(corr_mat, # colorful number
         addCoef.col = 1,    # Change font size of correlation coefficients
         number.cex = 0.5)

corrplot.mixed(corr_mat, lower = 'shade', upper = 'pie', order = 'hclust')

We first start exploring the data by plotting a correlation matrix between the variables of interest. We can use the corrplot function within the corrplot() package, and adjust the style to fit our aesthetics of desire. We see that the predictors are highly correlated, especially between URXMHH and URXECP, and URXMHBP and URXMBP.

3. Libraries

To fit the BKMR model, we use the kmbayes function. This function implements the Markov chain Monte Carlo (MCMC) algorithm.

#install.packages("bkmr")
library(bkmr)

The argument iter indicates the number of iterations of the MCMC sampler; y is the vector of outcomes, Z is a matrix of exposures (each column is an exposure variable); X is a matrix of covariates (each column is a covariate); verbose indicates whether interim output summarizing the progress of the model fitting should be printed; and varsel indicates whether to conduct variable selection on the predictors

4. Example: URX Modeling

set.seed(111)
dat <- SimData(n = 50, M = 4)
y <- dat$y # bmi
Z <- dat$Z # chemical mixtures
X <- dat$X # age

fitkm <- kmbayes(y = y, Z = Z, X = X, iter = 10000, verbose = FALSE, varsel = TRUE)

# set.seed(111)
# fitkm_corr <- kmbayes(y = d2$y, Z = d2$Z, X = d2$X, iter = 10000, varsel = TRUE, verbose = FALSE)
# fitkm_hier <- kmbayes(y = d2$y, Z = d2$Z, X = d2$X, iter = 10000, varsel = TRUE, 
#                       groups = c(1,2,1,3), verbose = FALSE)
set.seed(111)
sampled_data <- df %>% sample_frac(0.01)

y<-sampled_data$BMXBMI # bmi
Z <- sampled_data%>%select(-BMXBMI) # chemical mixtures
# group 1: URXMBP, URXMIB, URXMC1, URXMEP, URXMZP
# group 2: URXECP, URXMHH, URXMOH, URXMHP
X <- sampled_data%>%select(RIDAGEYR)
# log every term
fitkm_corr <- bkmr::kmbayes(y = y, Z = Z, X = X, iter = 2000, varsel = TRUE, verbose = FALSE)
#fitkm_hier <- kmbayes(y = y, Z = Z, X = X, iter = 10000, varsel = TRUE, groups = c(1,2,1,3), verbose = FALSE)

sel<-seq(0,1000,by=1)
bkmr::TracePlot(fit = fitkm_corr, par = "beta", sel=sel)

pred.resp.univar <- bkmr::PredictorResponseUnivar(fit = fitkm_corr)

library(ggplot2)
ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se, ymax = est + 1.96*se)) + 
    geom_smooth(stat = "identity") + 
    facet_wrap(~ variable) +
  ylab("h(z)")