Bayesian Weighted Sums (BWS)(Colicino et al. 2020; Hamra et al. 2021) is a Bayesian version of Weighted Quantile Sums Regression, which reweights quantiles of predictors and regresses the sum of weighted predictors on the outcome. Unlike WQS/qgcomp which is based on frequentist methods, BWS is a Bayesian method which integrates insights from data to update prior knowledge, and make inference based on the posterior distribution.
Bayesian Weighted Sums shares many similarities with Weighted Quantile Sum (WQS) and quantile g-computation (qgcomp). Similar to weighted quantile sum regression, BWS helps summarize the impact of highly correlated predictors and evaluate the importance of each predictor. In addition, Bayesian methods provide a more natural way to address the restriction (weights sum up to 1) in the model fitting, and is flexible to incorporate our prior knowledge, and also importantly provides an uncertainty quantification for weights, which are estimated but then treated as known in frequentist approaches.
In this tutorial, we are interested in the same research question: predicting log BMI using phthalate measurements from urine samples for individuals above 18 years of age.
In this tutorial, we will use the BWQS package, which can be installed from GitHub running the following lines in the Console.
# Additional installations may be necessary for installing bwqs, which can be loaded using the additional code below:
#install.packages("clusterGeneration")
# install BWQS package
#devtools::install_github("ElenaColicino/bwqs", build_vignettes = TRUE)
# loading packages
library(BWQS)
library(tidyverse)
Similar to previous tutorials, we will use datum from NHANES and look at participants who are over 18 years old and will take the natural log of BMI. Phthalates will be analyzed similar to the greater.
load(file='nhanes1518.rda')
head(nhanes1518)
## # A tibble: 6 × 52
## SEQN WTINT2YR WTMEC2YR RIDSTATR RIAGENDR RIDAGEYR INDFMPIR RIDRETH1 DMDEDUC2
## <int> <dbl> <dbl> <int> <int> <int> <dbl> <int> <int>
## 1 83732 134671. 135630. 2 1 62 4.39 3 5
## 2 83733 24329. 25282. 2 1 53 1.32 3 3
## 3 83734 12400. 12576. 2 1 78 1.51 3 3
## 4 83735 102718. 102079. 2 2 56 5 3 5
## 5 83736 17628. 18235. 2 2 42 1.23 4 4
## 6 83737 11252. 10879. 2 2 72 2.82 1 2
## # ℹ 43 more variables: DMDEDUC3 <int>, INDHHIN2 <int>, BMXBMI <dbl>,
## # BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## # URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## # URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## # URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## # URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## # URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …
test <- nhanes1518|>
dplyr::select(BMXBMI, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)
Below, we remove the participants under the age of eighteen and take the log of BMI.
nhanes <- nhanes1518 |>
filter(RIDAGEYR >= 18)|>
mutate(BMXBMIlog = log(BMXBMI),
RIDAGEYR = RIDAGEYR - 18)|>
dplyr::select(BMXBMIlog, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)
# We specify the select function comes from package `dplyr` by using `dplyr::select`
nhanes <- drop_na(nhanes)
Bayes Weighted Sums follows a similar formula to WQS. Each independent variable is assigned a weight and multiplied by the \(\beta_1\) coefficient. The formula is displayed below:
\[ g(E(Y))= \beta_0 + \beta_1 \times (w_1X_1 + w_2X_2 +...+w_kX_k) + \beta_2Z\]
In this model, \(X\) are quantiles of chemical mixtures, \(Z\) are other covariates. \(w\) are weights ranging from 0 to 1, and sum up to 1, presenting the importance of each chemical. \(\beta_0\) is intercept, \(\beta_1\) and \(\beta_2\) are coefficients. \(\beta_1\) is the coefficient of the weighted quantile sum, measuring the combined effect of the mixture exposures.
In a Bayesian regression, we assign prior to unknown parameters. The prior specification is based on our prior knowledge. For this example, we do not have valuable prior information, so we use a relatively less informative prior, the default prior in the package \(N(0,100)\) for each coefficient. (Note: the informativeness of the prior will depend on the scale of the predictor and outcome of interest. The N(0,100) prior places 95% of the prior mass on regression coefficient values between -19.6 and 19.6 (within 2 SD of the mean).)
Unlike \(\beta\), the specification for weights \(w\) is different since it poses a constraint that should sum up to 1, \(w_1 + w_2 + … + w_k = 1\). Since the posterior has the same support as that of the prior, such constraint on weights is easily addressed in Bayesian methods by specifying a prior on weights subject to the constraint. The prior distribution commonly used for weights is the Dirichlet distribution, denoted as \(w \sim Dir(\alpha)\). \(w\) represents weights with length \(k\), and \(\alpha\) represents concentration parameters with length \(k\).
The Dirichlet distribution is generalized from the beta distribution and is a continuous and multivariate probability distribution shown below (from wikipedia). Notice \(B(\alpha)\) is the (multivariate) beta distribution. The concentration parameters \(\alpha_1, \dots, \alpha_k\) of the distribution controls the distribution of probability, where \(k\) is the number of categories (number of chemicals in our case).
As shown above in the image, when the \(\alpha\) are equal, the same probability is placed on all categories. However, if \(\alpha\) are unbalanced, having different, suggest a tendency that some categories are more important. We can assign unequal \(\alpha\), if we have some prior knowledge showing some of predictors have more potential impact than the others.
The amplitude of \(\alpha\)
indicates the degree of confidence we have on prior knowledge. If we
strongly believe they have same importance, we can assign a large value
for \(\alpha\) and set them to be
equal. If we have no prior knowledge or very weak information, we can
set the \(\alpha\) to be same but with
\(\alpha = 1\), which is also the
default prior in BWQS
package.
It is important to note, the assumptions for the Dirichlet distribution include all weights must be positive, real numbers and must sum to 1. This indicates that BWQS require chemical mixtures are coded such that they all contribute in the same direction (unidirection assumption). But this assumption can be relaxed. For more details on avoiding the constraint that all chemicals have effects in similar directions, please see the paper by Colicino et. al., 2020.
BWQS
has similar arguments to the gWQS
and
qgcomp
packages:
q
- specifies the number of quantiles to be used
iter
- the number of iterations in the model (default
10,000)
chains
- number of chains in the Monte Carlo algorithm
(default 1)
thin
- the thinning parameter in the Monte Carlo
algorithm (integer)
prior
- the direction of the prior distribution (“None”
default, “Positive”, “Negative)
It is important to note that the prior of the \(\beta_1\) coefficient can be manually set to either the positive or negative direction. This can be useful if we have specific information regarding the prior and is accomplished by writing prior = “positive” or prior = “negative”. For example, if we feel strongly that greater phthalate exposures will be related to higher weights, we can specify “positive”; if we believe greater phthalate exposures will be related to lower weights, we can specify “negative.” If we are unsure, we can specify “None” (the default specification).
chem_names_new <- c('URXMHBP', 'URXMOH', 'URXMHP', 'URXMHH', 'URXMCOH', 'URXMHNC', 'URXECP', 'URXHIBP', 'URXMIB')
fit_bwqs = bwqs(BMXBMIlog~RIDAGEYR, mix_name = chem_names_new,
data = nhanes, q = 4, family = "gaussian",iter = 5000)
In addition to the model, the BWQS package has built-in visualization tools. Similar to WQS and gWQS, we can visualize the weights assigned to each of the predictors in our model.
bwqs_plot(fit_bwqs, parms = "W", size = 2)
# set parms = "W" to only visualize credible intervals of weights.
Additionally, similar to standard Bayesian regression models, we can us a 95% credible interval to view the results. We want the credible interval corresponding to our \(\beta_1\) value, as shown below. Note the dots are the mean, and the credible interval is displayed by a line.
fit_bwqs$summary_fit
## mean se_mean sd 2.5% 97.5% n_eff Rhat
## beta0 3.27679 0.00037 0.01001 3.25747 3.29595 733.7145 0.99923
## beta1 0.04000 0.00028 0.00775 0.02572 0.05601 779.1893 0.99994
## C_RIDAGEYR 0.00130 0.00001 0.00022 0.00086 0.00171 953.6246 0.99929
## W_URXMHBP 0.03034 0.00101 0.02968 0.00063 0.11086 863.6493 0.99983
## W_URXMOH 0.08063 0.00284 0.07373 0.00319 0.27772 674.0113 0.99894
## W_URXMHP 0.02346 0.00076 0.02264 0.00072 0.08609 875.8088 1.00068
## W_URXMHH 0.07559 0.00273 0.06740 0.00184 0.24973 607.7227 0.99920
## W_URXMCOH 0.19055 0.00430 0.12659 0.00975 0.46715 865.3739 1.00035
## W_URXMHNC 0.11894 0.00294 0.08317 0.00599 0.31224 801.2860 1.00467
## W_URXECP 0.32695 0.00540 0.12936 0.07669 0.60382 573.9067 0.99884
## W_URXHIBP 0.03284 0.00117 0.03203 0.00114 0.11662 748.8895 1.00080
## W_URXMIB 0.12070 0.00290 0.08009 0.00656 0.30102 765.0473 0.99901
## sigma 0.23129 0.00010 0.00274 0.22595 0.23655 777.5395 1.00025
We can say 95% of the posterior distribution for the mean BMI for an 18 year old exposed in the lowest quantile to all mixture elements is between 25.984 and 27.003 (calculate by \(e^{\beta_0}\)).
There is a 95% probability that the coefficient of WQS is between 0.026 and 0.056. To be more specific, holding other variables constant, for every one unit increase in WQS (for every one increase in quantile of each chemical mixture), a 2.605% to 5.761% increase in BMI with 95% probability. (It is calculated by \(e^\beta_1-1\) since we use logBMI as response.)
Holding other variables constant, for each additional year of age, 95% of the posterior distribution for the percentage change in BMI is between 0.086% and 0.171%.
It is important to note that if the credible interval includes 0, the relationship is not considered significant. In this example, 0 is contained within the interval, so we there is not sufficient evidence to conclude there is a relationship between the phthalates and log BMI. Moreover, we can calculate the posterior probability that higher phthalate levels are related to increased BMI (\(Pr(\beta_1>0|Y)\)) is 100%.
## beta1 represent the increase of logBMI with 1 unit increase of WQS holding other constant.
## beta1>0 indicates higher phthalate levels are related to increased logBMI/BMI.
## Calculate Pr(beta1>0|Y)
## Step 1: extract the posterior samples for beta_1. (use extract function from rstan package)
beta1_post = rstan::extract(fit_bwqs$fit)$beta1
## Step 2: calculate the proportion of posterior samples greater than 0.
mean(beta1_post>0)
## [1] 1
Additional information provided by the model includes the standard error, the effective sample size (size of independent samples), and R-hat which is the corresponds to the convergence of MCMC simulations. An effective sample size (\(\hat{R}\)) is typically considered reasonable if it is less that 1.01. (Vehtari et al. 2021)
When predicting using Bayesian methods, we usually obtain the
prediction interval based on posterior samples. Here we provide an
example for calculating a prediction interval manually. Notice the
BWQS
package is based on the rstan
package,
which does not have a predict()
function. We need to
manually construct our own predict()
function based on our
model.
# obtain the average chemical mixture level of the dataset.
# you can specify your own data to predict
new_data = nhanes |>
apply(2,mean) |>
t() |>
as.data.frame() |>
select( - BMXBMIlog)
new_data$RIDAGEYR = 0
# transfer the new data to the quantiles defined by the quantile of original data
q = 4 # the quantile we use
for (i in 1:length(chem_names_new)){
dat_num = as.numeric(unlist(nhanes[, chem_names_new[i]]))
bound = unique(quantile(dat_num, probs = seq(0, 1, by = 1/q), na.rm = TRUE))
new_data[[chem_names_new[i]]] = cut(new_data[[chem_names_new[i]]],breaks = bound,labels = FALSE,include.lowest = TRUE)-1
}
# new data presented in quantiles
new_data = new_data |>
as.matrix()
# Plug the data to the model
## our model is : Xb = beta0 + beta1*(X*W) + KV*delta
### posterior samples of parameters
param.sample <- as.data.frame(rstan::extract(fit_bwqs$fit)) |>
select(beta0:W.9) |>
as.matrix()
result.sample<- apply(param.sample, 1, function(x)
x["beta0"] + x["beta1"]*sum(new_data[-10] * x[4:12]) + x["delta"]*new_data[10])
## posterior mean and sd
mean(result.sample)
## [1] 3.34642
quantile (result.sample,c(0.025,0.975))
## 2.5% 97.5%
## 3.326794 3.363676
For a population of 18 years old and with fairly typical exposure values, 95% of the posterior distribution for the mean BMI is between 27.849 and 28.895.
For the model evaluation, instead of using AIC for our model evaluation, we use WAIC, which is widely used with Bayesian regression models (Watanabe 2013). We will create a second model predicting log BMI, excluding three chemicals whose posterior mean weights are close to zero.
chem_names_new_1 <- c('URXMOH','URXMHH','URXMCOH', 'URXMHNC', 'URXECP', 'URXMIB')
nhanes_1 = nhanes %>% select(-c('URXMHBP','URXMHP' , 'URXHIBP'))
fit_bwqs_less = bwqs(BMXBMIlog~RIDAGEYR, mix_name = chem_names_new_1,
data = nhanes_1, q = 4, family = "gaussian",iter = 5000)
The WAIC metric from the original model:
bwqs_waic(fit_bwqs$fit)$waic
## waic
## -312.644
The WAIC metric from the model with fewer chemicals:
bwqs_waic(fit_bwqs_less$fit)$waic
## waic
## -317.6794
A smaller WAIC value is better. Therefore, the model with fewer chemicals is better.
In all, the BWQS model has many similarities to WQS but offers distinct advantages. In WQS regression, weights are provided through point estimates, and uncertainty is typically addressed using bootstrapping methods. In contrast, BWQS provides uncertainty quantification by constructing credible intervals based on posterior samples. Additionally, the Bayesian methods introduces additional flexibility, enabling the construction of more complex models, such as hierarchical Bayesian WQS. Please see this vignette for more details.