Bayesian Weighted Sums (BWS) is a Bayesian version of Weighted Quantile Sums Regression, which reweights quantiles of predictors and regresses the sum of weighted predictors on the outcome. Unlike WQS/qgcomp which is based on frequentist methods, BWS is a Bayesian method which integrates insights from data to update prior knowledge, and make inference based on posterior.

Bayesian Weighted Sums shares many similarities with Weighted Quantile Sums (WQS) and quantile g-computation (qgcomp). Similar to weighted quantile sums regression, BWS helps summarize the impact of highly correlated predictors and evaluate the importance of each predictor. In addition, Bayesian methods provide a more natural way to address the restriction (weights sum up to 1) in the model fitting, and is flexible to incorporate our prior knowledge, and also provides an uncertainty quantification for weights.

In this tutorial, we are interested in the same research question: predicting log BMI using phalate measurements from urine samples for individuals above 18 years of age.

Preparation

In this tutorial, we will use the BWQS package, which can be installed from GitHub running the following lines in the Console.

# Additional installations may be necessary for installing bwqs, which can be loaded using the additional code below:
#install.packages("clusterGeneration")

# install BWQS package
#devtools::install_github("ElenaColicino/bwqs", build_vignettes = TRUE)

# loading packages
library(BWQS)
library(tidyverse)

Similar to previous tutorials, we will use datum from NHANES and look at participants who are over 18 years old and will take the natural log of BMI. Phalates will be analyzed similar to the greater.

load(file='nhanes1518.rda')
head(nhanes1518)
## # A tibble: 6 × 52
##    SEQN WTINT2YR WTMEC2YR RIDSTATR RIAGENDR RIDAGEYR INDFMPIR RIDRETH1 DMDEDUC2
##   <int>    <dbl>    <dbl>    <int>    <int>    <int>    <dbl>    <int>    <int>
## 1 83732  134671.  135630.        2        1       62     4.39        3        5
## 2 83733   24329.   25282.        2        1       53     1.32        3        3
## 3 83734   12400.   12576.        2        1       78     1.51        3        3
## 4 83735  102718.  102079.        2        2       56     5           3        5
## 5 83736   17628.   18235.        2        2       42     1.23        4        4
## 6 83737   11252.   10879.        2        2       72     2.82        1        2
## # ℹ 43 more variables: DMDEDUC3 <int>, INDHHIN2 <int>, BMXBMI <dbl>,
## #   BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## #   URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## #   URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## #   URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## #   URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## #   URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …
test <- nhanes1518|>
  dplyr::select(BMXBMI, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)

Below, we remove the participants under the age of eighteen and take the log of BMI.

nhanes <- nhanes1518 |>
    filter(RIDAGEYR >= 18)|>
    mutate(BMXBMIlog = log(BMXBMI),
          RIDAGEYR = RIDAGEYR - 18)|>
    dplyr::select(BMXBMI, BMXBMIlog, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)
# We specify the select function comes from package `dplyr` by using `dplyr::select`
nhanes <- drop_na(nhanes)

nhanes
## # A tibble: 3,523 × 12
##    BMXBMI BMXBMIlog URXMHBP URXMOH URXMHP URXMHH URXMCOH URXMHNC URXECP URXHIBP
##     <dbl>     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1   20.3      3.01    5.3     7.8   2       8.7    0.7     1.2    14.1    27.4
##  2   24.1      3.18    0.5     8.6   3.1    11.9    0.35    0.28   16.7     1.5
##  3   43.7      3.78    1.1     8.4   0.57   16.4    0.6     0.28   25.8     6.3
##  4   38        3.64    1.5     7.8   1.2    14.4    0.6     1.1    19.7     4.7
##  5   26.3      3.27    0.5     1.7   0.57    2.7    0.6     0.9     3.9     2  
##  6   25        3.22    1.9    15.4   8.2    27.6    0.35    0.5    26.1     6.1
##  7   22.8      3.13    0.28    6     5.1     9.5   14.2    72.2    14.1     3.7
##  8   34        3.53    1.8     1.9   2.1     2.7    0.35    0.4     4       4.9
##  9   31        3.43    1.5     5     2.6     6.2    0.35    0.28    9.4     1.4
## 10   34        3.53    0.28    2.6   0.57    5.1    0.9     0.8     5.9     3.2
## # ℹ 3,513 more rows
## # ℹ 2 more variables: URXMIB <dbl>, RIDAGEYR <dbl>

Bayesian Weighted Sums

Bayes Weighted Sums follows a similar formula to WQS. Each independent variable is assigned a weight and multiplied by the beta 1 coefficient. The formula is displayed below:

\[ Eg(Y)= \beta_0 + \beta_1 \times (w_1X_1 + w_2X_2 +...+w_kX_k) + \beta_2Z\]

In this model, \(X\) are quantiles of chemical mixtures, \(Z\) are other covariates. \(w\) are weights ranging from 0 to 1, and sum up to 1, presenting the importance of each chemical. \(\beta_0\) is intercept, \(\beta_1\) and \(\beta_2\) are coefficients. \(\beta_1\) is the coefficient of weighted quantile sum, measuring the combined effect of the mixture exposures.

In a Bayesian regression, we assign prior to unknown parameters. The prior specification is based on our prior knowledge. For this example, we do not have valuable prior information, so we use a relative less informative prior, the default prior in the package \(N(0,100)\) for each coefficient.

Unlike \(\beta\), the specification for weights \(w\) different since it poses a constraint that should sum up to 1, \(w_1 + w_2 + … + w_k = 1\). Since the posterior has the same support as that of the prior, such constraint on weights is easily addressed in Bayesian methods by specifying a prior on weights subject to the constraint. The prior commonly used for weights is Dirichlet distribution, denoted as \(w \sim Dir(\alpha)\). \(w\) represents weights with length \(k\), \(alpha\) represents concentration parameters with length \(k\).

The Dirichlet distribution is generalized from the beta distribution and is a continuous and multivariate probability distribution shown below. Notice \(B(\alpha)\) is the (multivariate) beta distribution. The concentration parameters \(\alpha_1, \dots, \alpha_k\) of the distribution controls the distribution of probability, where \(k\) is the number of categories (number of chemicals in our case).

As shown above in the image, when \(\alpha\) are equal, indicating same importance for all categories. However, if \(\alpha\) are unbalanced, having different, suggest a tendency that some categories are more important. We can assign unequal \(alpha\), if we have some prior knowledge showing some of predictors have more potential impact than the others.

The amplitude of \(\alpha\) indicates the degree of confidence we have on prior knowledge. If we strongly believe they have same importance, we can assign a large value for \(\alpha\) and set them to be equal. If we have no prior knowledge or very weak information, we can set the \(\alpha\) to be same but with \(\alpha = 1\), which is also the default prior in BWQS package.

It is important to note, the assumptions for the Dirichlet distribution include all weights must be positive, real numbers and must sum to 1. This indicates that BWQS require chemical mixtures contribute to the same direction. (unidirection assumption) But such assumption can be relaxed. With the flexibility of Bayesian methods, we can group the mixtures and model using different parameters separately. Please see the paper by Colicino et. al., 2020.

Model Fitting

BWQS has similar arguments to the gWQS and qgcomp packages:

q - specifies the number of quantiles to be used

iter - the number of iterations in the model (default 10,000)

chains - number of chains in the Monte Carlo algorithm (default 1)

thin - the thinning parameter in the Monte Carlo algorithm (integer)

prior - the direction of the prior distribution (“None” default, “Positive”, “Negative)

It is important to note that the prior of the \(\beta_1\) coefficient can be manually set to either the positive or negative direction. This can be useful if we have specific information regarding the prior and is accomplished by writing prior = “positive” or prior = “negative”.

chem_names_new <- c('URXMHBP', 'URXMOH',  'URXMHP',  'URXMHH',  'URXMCOH', 'URXMHNC', 'URXECP', 'URXHIBP', 'URXMIB')
# here we only select 100 individuals as an example
set.seed(1)
nhanes_s = sample_n(tbl = nhanes,size = 100,replace = FALSE)

fit_bwqs = bwqs(BMXBMIlog~RIDAGEYR, mix_name = chem_names_new,
                data = nhanes_s, q = 4, family = "gaussian",iter = 5000)

Model Interpretation and Inference

In addition to the model, the BWQS package has built-in visualization tools. Similar to WQS and gWQS, we can visualize the weights assigned to each of the predictors in our model.

bwqs_plot(fit_bwqs, parms = "W", size = 2)

# set parms = "W" to only visualize credible intervals of weights.

Additionally, similar to standard Bayesian regression models, we can us a 95% credible interval to view the results. We want the credible interval corresponding to our beta1 value, as shown below. Note the small dots correspond to the median, the large dots are the mean, and the credible interval is displayed by a line.

fit_bwqs$summary_fit 
##                mean se_mean      sd     2.5%   97.5%    n_eff    Rhat
## beta0       3.43050 0.00220 0.05781  3.31390 3.54556 688.3143 0.99908
## beta1      -0.06432 0.00125 0.03429 -0.12880 0.00000 754.1506 0.99936
## C_RIDAGEYR  0.00041 0.00005 0.00132 -0.00238 0.00293 835.5107 1.00149
## W_URXMHBP   0.13252 0.00379 0.10803  0.00537 0.40491 813.3773 0.99881
## W_URXMOH    0.09411 0.00312 0.09074  0.00200 0.33795 847.4978 1.00004
## W_URXMHP    0.12279 0.00351 0.09869  0.00332 0.35885 790.1065 0.99897
## W_URXMHH    0.08759 0.00332 0.08144  0.00323 0.31155 601.9338 0.99977
## W_URXMCOH   0.10346 0.00319 0.09037  0.00322 0.33468 800.1721 1.00095
## W_URXMHNC   0.08763 0.00286 0.08334  0.00231 0.31197 850.6066 0.99885
## W_URXECP    0.08056 0.00276 0.07514  0.00228 0.25975 742.7438 0.99884
## W_URXHIBP   0.14838 0.00410 0.11652  0.00490 0.42527 809.6495 0.99977
## W_URXMIB    0.14296 0.00417 0.11725  0.00518 0.42406 789.4354 1.00161
## sigma       0.22797 0.00060 0.01646  0.20070 0.26559 742.2037 1.00336

We are 95% confident that the average BMI for an individual at the age of 18 with low level of all chemical mixtures lies between 27.492 and 34.659.

Holding other variables constant, for every one unit increase in WQS (for every one increase in quantile of each chemical mixture), we are 95% confidence to see BMI increase by a percentage between -12.085% and 0%. (It is calculated by exp(beta1-1) since we use logBMI as response.)

Holding other variables constant, for each additional year of age, we are 95% confidence to see BMI increase by a percentage between -0.238% and 0.293%.

It is important to note that if the credible interval includes 0, the relationship is not considered significant. In this example, 0 is contained within the interval, so we there is not sufficient evidence to conclude there is a relationship between the phalates and log BMI.

Additional information provided by the model includes the standard error, the effective sample size (size of independent samples), and R-hat which is the corresponds to the convergence of MCMC simulations. A reasonable effective sample size (R-hat) is typically considered reasonable if it is less that 1.01.

Model Prediction

When predicting using Bayesian methods, we usually obtain prediction interval based on posterior samples. Here we provide an example for calculating prediction interval manually. Notice the BWQS package is based on rstan package, which does not have predict() function. We need to manually construct our own predict() function based on our model.

# obtain the average chemical mixture level of the dataset.
# you can specify your own data to predict
new_data = nhanes_s %>%
  apply(.,2,mean) %>%
  t() %>%
  as.data.frame() %>%
  select(-BMXBMI, - BMXBMIlog)
new_data$RIDAGEYR = 0

# transfer the new data to the quantiles defined by the quantile of original data
q = 4 # the quantile we use
for (i in 1:length(chem_names_new)){
      dat_num = as.numeric(unlist(nhanes_s[, chem_names_new[i]]))
      bound = unique(quantile(dat_num, probs = seq(0, 1, by = 1/q), na.rm = TRUE)) 
      new_data[[chem_names_new[i]]] = cut(new_data[[chem_names_new[i]]],breaks = bound,labels = FALSE,include.lowest = TRUE)-1
}
# new data presented in quantiles
new_data = new_data %>% as.matrix()

# Plug the data to the model
## our model is : Xb = beta0 + beta1*(X*W) + KV*delta

### posterior samples of parameters
param.sample <- as.data.frame(rstan::extract(fit_bwqs$fit)) %>%
  select(beta0:W.9) %>% as.matrix()

result.sample<- apply(param.sample, 1, function(x) 
  x["beta0"] + x["beta1"]*sum(new_data[-10] * x[4:12]) + x["delta"]*new_data[10])


## posterior mean and sd
mean(result.sample)
## [1] 3.303706
quantile (result.sample,c(0.025,0.975))
##     2.5%    97.5% 
## 3.198521 3.411804

For a population of 18 years old and with fairly typical exposure values , we are 95% confident that the BMI will between 24.496 and 30.32.

Model Evaluation

For the model evaluation, instead of using AIC for our model evaluation, we use WAIC, which is widely used with Bayesian regression models. We will create a second model predicting log BMI but only use half of chemical mixtures (remove the 3 chemicals).

chem_names_newage <- c('URXMHH','URXMCOH', 'URXMHNC', 'URXECP', 'URXHIBP', 'URXMIB','RIDAGEYR')
nhanes_s1 = nhanes_s %>% select(-c('URXMHBP', 'URXMOH',  'URXMHP'))

fit_bwqs_less = bwqs(BMXBMIlog~RIDAGEYR, mix_name = chem_names_newage,
                data = nhanes_s1, q = 4, family = "gaussian",iter = 5000)

The WAIC metric from the original model:

bwqs_waic(fit_bwqs$fit)$waic
##      waic 
## -7.495507

The WAIC metric from the model with less chemical mixtures:

bwqs_waic(fit_bwqs_less$fit)$waic
##      waic 
## -7.239514

A smaller WAIC value is better. Therefore, the model with more chemical mixtures is better.

Discussion

In all, the BWQS model has many similarities to WQS and qgcomp. However, the Bayesian methods can be utilized allowing additional flexibility in the model which I do not address here. Please see this vignette for more details on HIerarchical Bayesian Quantile Weighted Sums.