Bayesian Weighted Sums (BWS) is a Bayesian version of Weighted Quantile Sums Regression, which reweights quantiles of predictors and regresses the sum of weighted predictors on the outcome. Unlike WQS/qgcomp which is based on frequentist methods, BWS is a Bayesian method which integrates insights from data to update prior knowledge, and make inference based on posterior.
Bayesian Weighted Sums shares many similarities with Weighted Quantile Sums (WQS) and quantile g-computation (qgcomp). Similar to weighted quantile sums regression, BWS helps summarize the impact of highly correlated predictors and evaluate the importance of each predictor. In addition, Bayesian methods provide a more natural way to address the restriction (weights sum up to 1) in the model fitting, and is flexible to incorporate our prior knowledge, and also provides an uncertainty quantification for weights.
In this tutorial, we are interested in the same research question: predicting log BMI using phalate measurements from urine samples for individuals above 18 years of age.
In this tutorial, we will use the BWQS package, which can be installed from GitHub running the following lines in the Console.
# Additional installations may be necessary for installing bwqs, which can be loaded using the additional code below:
#install.packages("clusterGeneration")
# install BWQS package
#devtools::install_github("ElenaColicino/bwqs", build_vignettes = TRUE)
# loading packages
library(BWQS)
library(tidyverse)
Similar to previous tutorials, we will use datum from NHANES and look at participants who are over 18 years old and will take the natural log of BMI. Phalates will be analyzed similar to the greater.
load(file='nhanes1518.rda')
head(nhanes1518)
## # A tibble: 6 × 52
## SEQN WTINT2YR WTMEC2YR RIDSTATR RIAGENDR RIDAGEYR INDFMPIR RIDRETH1 DMDEDUC2
## <int> <dbl> <dbl> <int> <int> <int> <dbl> <int> <int>
## 1 83732 134671. 135630. 2 1 62 4.39 3 5
## 2 83733 24329. 25282. 2 1 53 1.32 3 3
## 3 83734 12400. 12576. 2 1 78 1.51 3 3
## 4 83735 102718. 102079. 2 2 56 5 3 5
## 5 83736 17628. 18235. 2 2 42 1.23 4 4
## 6 83737 11252. 10879. 2 2 72 2.82 1 2
## # ℹ 43 more variables: DMDEDUC3 <int>, INDHHIN2 <int>, BMXBMI <dbl>,
## # BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## # URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## # URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## # URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## # URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## # URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …
test <- nhanes1518|>
dplyr::select(BMXBMI, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)
Below, we remove the participants under the age of eighteen and take the log of BMI.
nhanes <- nhanes1518 |>
filter(RIDAGEYR >= 18)|>
mutate(BMXBMIlog = log(BMXBMI),
RIDAGEYR = RIDAGEYR - 18)|>
dplyr::select(BMXBMI, BMXBMIlog, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)
# We specify the select function comes from package `dplyr` by using `dplyr::select`
nhanes <- drop_na(nhanes)
nhanes
## # A tibble: 3,523 × 12
## BMXBMI BMXBMIlog URXMHBP URXMOH URXMHP URXMHH URXMCOH URXMHNC URXECP URXHIBP
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20.3 3.01 5.3 7.8 2 8.7 0.7 1.2 14.1 27.4
## 2 24.1 3.18 0.5 8.6 3.1 11.9 0.35 0.28 16.7 1.5
## 3 43.7 3.78 1.1 8.4 0.57 16.4 0.6 0.28 25.8 6.3
## 4 38 3.64 1.5 7.8 1.2 14.4 0.6 1.1 19.7 4.7
## 5 26.3 3.27 0.5 1.7 0.57 2.7 0.6 0.9 3.9 2
## 6 25 3.22 1.9 15.4 8.2 27.6 0.35 0.5 26.1 6.1
## 7 22.8 3.13 0.28 6 5.1 9.5 14.2 72.2 14.1 3.7
## 8 34 3.53 1.8 1.9 2.1 2.7 0.35 0.4 4 4.9
## 9 31 3.43 1.5 5 2.6 6.2 0.35 0.28 9.4 1.4
## 10 34 3.53 0.28 2.6 0.57 5.1 0.9 0.8 5.9 3.2
## # ℹ 3,513 more rows
## # ℹ 2 more variables: URXMIB <dbl>, RIDAGEYR <dbl>
Bayes Weighted Sums follows a similar formula to WQS. Each independent variable is assigned a weight and multiplied by the beta 1 coefficient. The formula is displayed below:
\[ Eg(Y)= \beta_0 + \beta_1 \times (w_1X_1 + w_2X_2 +...+w_kX_k) + \beta_2Z\]
In this model, \(X\) are quantiles of chemical mixtures, \(Z\) are other covariates. \(w\) are weights ranging from 0 to 1, and sum up to 1, presenting the importance of each chemical. \(\beta_0\) is intercept, \(\beta_1\) and \(\beta_2\) are coefficients. \(\beta_1\) is the coefficient of weighted quantile sum, measuring the combined effect of the mixture exposures.
In a Bayesian regression, we assign prior to unknown parameters. The prior specification is based on our prior knowledge. For this example, we do not have valuable prior information, so we use a relative less informative prior, the default prior in the package \(N(0,100)\) for each coefficient.
Unlike \(\beta\), the specification for weights \(w\) different since it poses a constraint that should sum up to 1, \(w_1 + w_2 + … + w_k = 1\). Since the posterior has the same support as that of the prior, such constraint on weights is easily addressed in Bayesian methods by specifying a prior on weights subject to the constraint. The prior commonly used for weights is Dirichlet distribution, denoted as \(w \sim Dir(\alpha)\). \(w\) represents weights with length \(k\), \(alpha\) represents concentration parameters with length \(k\).
The Dirichlet distribution is generalized from the beta distribution and is a continuous and multivariate probability distribution shown below. Notice \(B(\alpha)\) is the (multivariate) beta distribution. The concentration parameters \(\alpha_1, \dots, \alpha_k\) of the distribution controls the distribution of probability, where \(k\) is the number of categories (number of chemicals in our case).
As shown above in the image, when \(\alpha\) are equal, indicating same importance for all categories. However, if \(\alpha\) are unbalanced, having different, suggest a tendency that some categories are more important. We can assign unequal \(alpha\), if we have some prior knowledge showing some of predictors have more potential impact than the others.
The amplitude of \(\alpha\)
indicates the degree of confidence we have on prior knowledge. If we
strongly believe they have same importance, we can assign a large value
for \(\alpha\) and set them to be
equal. If we have no prior knowledge or very weak information, we can
set the \(\alpha\) to be same but with
\(\alpha = 1\), which is also the
default prior in BWQS
package.
It is important to note, the assumptions for the Dirichlet distribution include all weights must be positive, real numbers and must sum to 1. This indicates that BWQS require chemical mixtures contribute to the same direction. (unidirection assumption) But such assumption can be relaxed. With the flexibility of Bayesian methods, we can group the mixtures and model using different parameters separately. Please see the paper by Colicino et. al., 2020.
BWQS
has similar arguments to the gWQS
and
qgcomp
packages:
q
- specifies the number of quantiles to be used
iter
- the number of iterations in the model (default
10,000)
chains
- number of chains in the Monte Carlo algorithm
(default 1)
thin
- the thinning parameter in the Monte Carlo
algorithm (integer)
prior
- the direction of the prior distribution (“None”
default, “Positive”, “Negative)
It is important to note that the prior of the \(\beta_1\) coefficient can be manually set to either the positive or negative direction. This can be useful if we have specific information regarding the prior and is accomplished by writing prior = “positive” or prior = “negative”.
chem_names_new <- c('URXMHBP', 'URXMOH', 'URXMHP', 'URXMHH', 'URXMCOH', 'URXMHNC', 'URXECP', 'URXHIBP', 'URXMIB')
# here we only select 100 individuals as an example
set.seed(1)
nhanes_s = sample_n(tbl = nhanes,size = 100,replace = FALSE)
fit_bwqs = bwqs(BMXBMIlog~RIDAGEYR, mix_name = chem_names_new,
data = nhanes_s, q = 4, family = "gaussian",iter = 5000)
In addition to the model, the BWQS package has built-in visualization tools. Similar to WQS and gWQS, we can visualize the weights assigned to each of the predictors in our model.
bwqs_plot(fit_bwqs, parms = "W", size = 2)
# set parms = "W" to only visualize credible intervals of weights.
Additionally, similar to standard Bayesian regression models, we can us a 95% credible interval to view the results. We want the credible interval corresponding to our beta1 value, as shown below. Note the small dots correspond to the median, the large dots are the mean, and the credible interval is displayed by a line.
fit_bwqs$summary_fit
## mean se_mean sd 2.5% 97.5% n_eff Rhat
## beta0 3.43050 0.00220 0.05781 3.31390 3.54556 688.3143 0.99908
## beta1 -0.06432 0.00125 0.03429 -0.12880 0.00000 754.1506 0.99936
## C_RIDAGEYR 0.00041 0.00005 0.00132 -0.00238 0.00293 835.5107 1.00149
## W_URXMHBP 0.13252 0.00379 0.10803 0.00537 0.40491 813.3773 0.99881
## W_URXMOH 0.09411 0.00312 0.09074 0.00200 0.33795 847.4978 1.00004
## W_URXMHP 0.12279 0.00351 0.09869 0.00332 0.35885 790.1065 0.99897
## W_URXMHH 0.08759 0.00332 0.08144 0.00323 0.31155 601.9338 0.99977
## W_URXMCOH 0.10346 0.00319 0.09037 0.00322 0.33468 800.1721 1.00095
## W_URXMHNC 0.08763 0.00286 0.08334 0.00231 0.31197 850.6066 0.99885
## W_URXECP 0.08056 0.00276 0.07514 0.00228 0.25975 742.7438 0.99884
## W_URXHIBP 0.14838 0.00410 0.11652 0.00490 0.42527 809.6495 0.99977
## W_URXMIB 0.14296 0.00417 0.11725 0.00518 0.42406 789.4354 1.00161
## sigma 0.22797 0.00060 0.01646 0.20070 0.26559 742.2037 1.00336
We are 95% confident that the average BMI for an individual at the age of 18 with low level of all chemical mixtures lies between 27.492 and 34.659.
Holding other variables constant, for every one unit increase in WQS (for every one increase in quantile of each chemical mixture), we are 95% confidence to see BMI increase by a percentage between -12.085% and 0%. (It is calculated by exp(beta1-1) since we use logBMI as response.)
Holding other variables constant, for each additional year of age, we are 95% confidence to see BMI increase by a percentage between -0.238% and 0.293%.
It is important to note that if the credible interval includes 0, the relationship is not considered significant. In this example, 0 is contained within the interval, so we there is not sufficient evidence to conclude there is a relationship between the phalates and log BMI.
Additional information provided by the model includes the standard error, the effective sample size (size of independent samples), and R-hat which is the corresponds to the convergence of MCMC simulations. A reasonable effective sample size (R-hat) is typically considered reasonable if it is less that 1.01.
When predicting using Bayesian methods, we usually obtain prediction
interval based on posterior samples. Here we provide an example for
calculating prediction interval manually. Notice the BWQS
package is based on rstan
package, which does not have
predict()
function. We need to manually construct our own
predict()
function based on our model.
# obtain the average chemical mixture level of the dataset.
# you can specify your own data to predict
new_data = nhanes_s %>%
apply(.,2,mean) %>%
t() %>%
as.data.frame() %>%
select(-BMXBMI, - BMXBMIlog)
new_data$RIDAGEYR = 0
# transfer the new data to the quantiles defined by the quantile of original data
q = 4 # the quantile we use
for (i in 1:length(chem_names_new)){
dat_num = as.numeric(unlist(nhanes_s[, chem_names_new[i]]))
bound = unique(quantile(dat_num, probs = seq(0, 1, by = 1/q), na.rm = TRUE))
new_data[[chem_names_new[i]]] = cut(new_data[[chem_names_new[i]]],breaks = bound,labels = FALSE,include.lowest = TRUE)-1
}
# new data presented in quantiles
new_data = new_data %>% as.matrix()
# Plug the data to the model
## our model is : Xb = beta0 + beta1*(X*W) + KV*delta
### posterior samples of parameters
param.sample <- as.data.frame(rstan::extract(fit_bwqs$fit)) %>%
select(beta0:W.9) %>% as.matrix()
result.sample<- apply(param.sample, 1, function(x)
x["beta0"] + x["beta1"]*sum(new_data[-10] * x[4:12]) + x["delta"]*new_data[10])
## posterior mean and sd
mean(result.sample)
## [1] 3.303706
quantile (result.sample,c(0.025,0.975))
## 2.5% 97.5%
## 3.198521 3.411804
For a population of 18 years old and with fairly typical exposure values , we are 95% confident that the BMI will between 24.496 and 30.32.
For the model evaluation, instead of using AIC for our model evaluation, we use WAIC, which is widely used with Bayesian regression models. We will create a second model predicting log BMI but only use half of chemical mixtures (remove the 3 chemicals).
chem_names_newage <- c('URXMHH','URXMCOH', 'URXMHNC', 'URXECP', 'URXHIBP', 'URXMIB','RIDAGEYR')
nhanes_s1 = nhanes_s %>% select(-c('URXMHBP', 'URXMOH', 'URXMHP'))
fit_bwqs_less = bwqs(BMXBMIlog~RIDAGEYR, mix_name = chem_names_newage,
data = nhanes_s1, q = 4, family = "gaussian",iter = 5000)
The WAIC metric from the original model:
bwqs_waic(fit_bwqs$fit)$waic
## waic
## -7.495507
The WAIC metric from the model with less chemical mixtures:
bwqs_waic(fit_bwqs_less$fit)$waic
## waic
## -7.239514
A smaller WAIC value is better. Therefore, the model with more chemical mixtures is better.
In all, the BWQS model has many similarities to WQS and qgcomp. However, the Bayesian methods can be utilized allowing additional flexibility in the model which I do not address here. Please see this vignette for more details on HIerarchical Bayesian Quantile Weighted Sums.