Assume we are interested in exploring the relationship between the chemical exposures and the outcome (BMI for example). These chemical exposures are usually highly correlated with each other (known as multicollinearity phenomenon in regression). Multicollinearity among the predictors impedes model inference and prediction. The standard errors for our regression coefficients will inflate, and thus we will lose precision in our estimates of the regression coefficients. To eliminate the issue of multicollinearity, we can consider summarizing these highly correlated chemical mixtures when putting them into a regression model. The idea of summary motivates the weighted quantile sum regression.
Unlike ordinary multivariate regression, the Weighted Quantile Sum
Regression (WQS) model uses quantiles of chemical exposures and weighs
them based on their individual impacts on the overall outcome. The
exposure that has a strong impact on the mixture are assigned a greater
weight while others that are not as critical are given a lower weight.
The weighted sum of quantiles of chemical exposures is then treated as a
predictor and regress on the response variable. We utilize the r package
gWQS
to illustrate model fitting, prediction, and
evaluation.
We recommend another package
qgcomp
, which allows us to use a similar method, but has faster computation and allows more flexibility in the assumptions and data that can be used in the model.
In order to investigate relationships between chemical exposures
using the gWQS
package, there are several packages that
must be installed in R. Similar to previous tutorials where packages
were introduced, we will be using tidyverse
and
tidymodels
for data wrangling and model building, but we
will also use ggcorrplot
for data visualization.
In order to use the new packages in this tutorial, they must first be
installed in R by running the following lines of code
install.packages("ggcorrplot")
and
install.packages("gWQS")
in the console.
Additional installations may be required for mac users!
library(gWQS)
library(tidyverse)
library(tidymodels)
library(ggcorrplot)
In this tutorial, we use the new base R pipe operator “|>” which is equivalent to “%>%”
As discussed earlier, we want to apply Weighted Quantile Sums to a
real world issue. This example uses the NHANES
dataset to
focus on specific chemical exposures: Phalates and Phytoestrogens. Both
of these exposures can be detected in urine samples and we want to
estimate their respective impact on the outcome, body mass index. In
order to predict body mass index with urine samples, we first need to
reintroduce the NHANES
dataset for use in this example.
The NHANES dataset can be loaded into R:
load(file='nhanes1518.rda')
head(nhanes1518)
## # A tibble: 6 × 52
## SEQN WTINT2YR WTMEC2YR RIDSTATR RIAGENDR RIDAGEYR INDFMPIR RIDRETH1 DMDEDUC2
## <int> <dbl> <dbl> <int> <int> <int> <dbl> <int> <int>
## 1 83732 134671. 135630. 2 1 62 4.39 3 5
## 2 83733 24329. 25282. 2 1 53 1.32 3 3
## 3 83734 12400. 12576. 2 1 78 1.51 3 3
## 4 83735 102718. 102079. 2 2 56 5 3 5
## 5 83736 17628. 18235. 2 2 42 1.23 4 4
## 6 83737 11252. 10879. 2 2 72 2.82 1 2
## # ℹ 43 more variables: DMDEDUC3 <int>, INDHHIN2 <int>, BMXBMI <dbl>,
## # BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## # URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## # URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## # URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## # URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## # URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …
Once the data is loaded, we can proceed to understand how our chemical exposures (detected in urine samples) impact the body mass index of participants.
Similar to the previous tutorial where age is filtered, we only want
to consider participants in the data that are above 18 years old and
take the natural log of BMI. By filtering and using the natural log, the
distribution of the ages of participants are normalized. To finish
preparing the dataset, the N/A values are removed and the resulting data
is saved under the name nhanes
.
nhanes <- nhanes1518 |>
filter(RIDAGEYR >= 18)|>
mutate(BMXBMI = log(BMXBMI),
RIDAGEYR = RIDAGEYR - 18)|>
select(BMXBMI, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB)
nhanes <- drop_na(nhanes)
nhanes
## # A tibble: 3,523 × 10
## BMXBMI URXMHBP URXMOH URXMHP URXMHH URXMCOH URXMHNC URXECP URXHIBP URXMIB
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.01 5.3 7.8 2 8.7 0.7 1.2 14.1 27.4 73.7
## 2 3.18 0.5 8.6 3.1 11.9 0.35 0.28 16.7 1.5 4.4
## 3 3.78 1.1 8.4 0.57 16.4 0.6 0.28 25.8 6.3 32.7
## 4 3.64 1.5 7.8 1.2 14.4 0.6 1.1 19.7 4.7 8.8
## 5 3.27 0.5 1.7 0.57 2.7 0.6 0.9 3.9 2 7.1
## 6 3.22 1.9 15.4 8.2 27.6 0.35 0.5 26.1 6.1 13.4
## 7 3.13 0.28 6 5.1 9.5 14.2 72.2 14.1 3.7 17
## 8 3.53 1.8 1.9 2.1 2.7 0.35 0.4 4 4.9 14.7
## 9 3.43 1.5 5 2.6 6.2 0.35 0.28 9.4 1.4 3.7
## 10 3.53 0.28 2.6 0.57 5.1 0.9 0.8 5.9 3.2 12.8
## # ℹ 3,513 more rows
To confirm the normality of the variable BMI, we can create a histogram to view the shape of the observations:
nhanes|> ggplot(aes(x = BMXBMI))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In order to use the gWQS package in an application setting, several conditions must be followed. The first condition is that the chemical exposures must ALL contribute in one direction. In order to contribute in one direction, all variables must be either positively correlated, or negatively correlated with each other. In addition to the correlation of the exposures, the dataset must be split into two sections. The first section is the training data for providing weights to the variables, while the second is used exclusively for verifying the model and ensuring that the results are reliable.
In order to use weighted quantile sum regression with the
nhanes
data, it is important to first confirm that all
constraints are satisfied:
An initial correlation model reveals that the URX predictor variables all contribute in a single direction, validating the appropriateness of the gWQS model.
cor_hanes <- cor(nhanes)
ggcorrplot(cor_hanes)+
labs(title = "Corrleation of URX Variables and BMI")
#Source http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2#:~:text=The%20easiest%20way%20to%20visualize,ggcorr()%20in%20ggally%20package
However, if we fit a normal linear regression model to these predictors, positive and negative estimates would be produced. Intuitively, the URX variables should all have positive coefficients, however, this model is not representative of this assumption. This is a demonstration of multi-collinearity, where there are there is a large uncertainty (variance) when calculating the coefficients. Therefore, the output of the model may produce unusual signs for the beta estimates.
As shown below, some of the URX betas are positive, while others are negative. Instead, we want to use a gWQS model which will provide a unidirectional output, similar to a standard linear regression model.
#large uncertanty (variance) around the coefficients and multicolinearity (including that the coeffients may be negatve) - point estimate may be negative (abnormal signs)
linear_reg()|>
fit(BMXBMI~., data = nhanes)|>
tidy()
## # A tibble: 10 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.36 0.00509 660. 0
## 2 URXMHBP -0.0000177 0.000839 -0.0211 0.983
## 3 URXMOH 0.00209 0.00174 1.21 0.228
## 4 URXMHP -0.00704 0.00164 -4.29 0.0000182
## 5 URXMHH 0.000405 0.00101 0.401 0.688
## 6 URXMCOH 0.00238 0.00105 2.26 0.0236
## 7 URXMHNC -0.000869 0.000402 -2.16 0.0306
## 8 URXECP -0.0000514 0.000285 -0.180 0.857
## 9 URXHIBP -0.00567 0.00155 -3.67 0.000250
## 10 URXMIB 0.00185 0.000464 3.98 0.0000693
Once the conditions are satisfied, a model can be produced, similarly to standard linear regression model. One difference from the linear model is that unique weights are assigned to the predictor variables.
\[ g(E(Y))= \beta_0 + \beta_0 \times (WQS) \] or it can be written as an expanded model where \(w_n\) corresponds the estimated weights for each X variable and \(\theta_1\) is the predicted impact of the WQS method on all of the chemical exposures.
\[ g(E(Y))= \beta_0 + \beta_1(w_1X_1 + w_2X_2 + ... w_nX_n) \]
Each outcome variable is assigned a “score”, meaning they are placed in bins using quantiles qi = {0, 1, 2, 3, etc.} For each exposure, there are cutoff points (or weights) that determine where each variable is placed. The weighted components are then summed and are fitted to the regression model. gWQS visualizes the assigned weights and regression fit as demonstrated in the package below:
The gWQS model requires the explanatory mixtures to be saved to a
variable. In this example, we saved them to chem_names
.
Additionally, the output of the gWQS
package is saved to
the variable results
.
Key arguments in the qWQS package:
q
- specifies the number of quantiles to be used
validation
- the percentage of the data set to be used
for creating the weights/bootstrapping and the validating the model
b1_pos
- specify TRUE
if the data has a
positive correlation and
FALSE
if it has a negative correlation
family
- type of association to be tested (ex: gaussian,
binomial)
data
- the data-frame containing the data to be
tested
b
- number of bootstrap samples
mix_name
- a vector containing the names of the
exposures
seed
- set a seed which allows the results to be
replicated
chem_names <- names(nhanes)[2:10]
results <- gwqs(BMXBMI ~ wqs, mix_name = chem_names, data = nhanes,
q = 10, validation = 0.6, b = 1, b1_pos = TRUE,
b1_constr = FALSE, family = "gaussian", seed=1)
It is important to set a seed in the gWQS model. This ensures that the results are replicated consistently when the model is rendered.
The estimated weights are shown numerically below:
head(results$final_weights)
## mix_name mean_weight
## URXECP URXECP 4.635526e-01
## URXMCOH URXMCOH 4.231238e-01
## URXMIB URXMIB 1.133201e-01
## URXMHH URXMHH 1.766438e-06
## URXMOH URXMOH 1.535145e-06
## URXMHNC URXMHNC 1.131085e-07
Using these weights, the final regression results can be obtained:
summary(results)
##
## Call:
## gwqs(formula = BMXBMI ~ wqs, data = nhanes, mix_name = chem_names,
## b = 1, b1_pos = TRUE, b1_constr = FALSE, q = 10, validation = 0.6,
## family = "gaussian", seed = 1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.57623 -0.16121 -0.01061 0.14323 0.95830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.318215 0.009803 338.499 < 2e-16 ***
## wqs 0.014915 0.002861 5.214 2.03e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.05406224)
##
## Null deviance: 115.65 on 2113 degrees of freedom
## Residual deviance: 114.18 on 2112 degrees of freedom
## AIC: -164.58
##
## Number of Fisher Scoring iterations: 2
The Regression output:
\[ \widehat{log(\text{BMXBMI})} = 3.31 + 0.015 \times (WQS) \]
Interpretation of the model:
When quantiles of the URX variables are 0, the estimated log BMI is approximately 3.31.
As every quantile of the exposures increases by 1, the percent change in BMI increases, on average, by 1.5113%.
It is important to acknowledge the uncertainty in the WQS output. One
method of accounting for this uncertainty is through confidence
intervals. A confidence interval allows us to understand between which
values are the estimations from our model likely to occur in the
real-world, accounting for model variations. Using the
confint(model name)
function in R, we can determine between
which WQS coefficient by the model.
confint(results)
## 2.5 % 97.5 %
## (Intercept) 3.299001588 3.33742763
## wqs 0.009308286 0.02052193
We can be 95% confident that the true wqs coefficient falls within the interval 0.009 and 0.021.
We may also be interested in testing whether URX variables have significant impact on BMI. Included in the regression output is the results significance tests of the coefficients. To test the significance of the model, we can use a t-distribution and t-statistic. The t-statistic for WQS is 5.214 using the t-distribution. The p-value corresponding to the WQS quantiles is extremely small (2.03e-07). Therefore, there is a significant relationship between the weighted quantiles and the predicted outcome, logBMI.
The gWQS package provides methods for understanding and visualizing the results. One of the most helpful ways to view the output of is through visualizations of the weights assigned to each variable. The weights can demonstrate which exposures from the urine samples are impactful in predicting BMI and which are less helpful.
gwqs_barplot(results)
The first visualization produced is a bar plot showing the weights assigned to each of the URX variables. The variables with greatest weights are URXCEP, URXMOH, and URHMIB while the remaining chemicals are not great predictors of BMI. Also shown in the visualization is a red line which shows the cutoff values - calculated by the inverse of the number of predictor variables.
In addition to the weights assigned to the chemical exposures, it can be helpful to visualize the overall relationship of the WQS model and the predicted BMI. To visualize this relationship, a scatterplot can be utilized to see if there is a linear relationship in the model.
gwqs_scatterplot(results)
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot of the WQS model and BMI demonstrate the linear relationship between the URX chemicals and log BMI.
Most importantly, we can use the WQS regression model to predict our outcome variable (log BMI). Using our model above, we can use the predict() function to determine the corresponding result. The predict function takes our WQS model and the dataframe containing the data for our prediction respectively. Values are from a single data point in the original NHANES dataset.
new_data <- data.frame(
URXMHBP = 12.9,
URXMOH = 6.3,
URXMHP = 2.10,
URXMHH = 26.2,
URXMCOH = 1.40,
URXMHNC = 2.30,
URXECP = 9.7,
URXHIBP = 19.9,
URXMIB = 23.5,
BMXBMI = 3.718438
)
predict(results, newdata = new_data)
## $df_pred
## y ypred
## 1 3.718438 3.385239
##
## $Q
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 7 7 3 9 3 4 5 9 8
##
## $qi
## $qi$URXMHBP
## [1] -Inf 0.4 0.6 0.8 1.0 1.4 1.8 3.0 Inf
##
## $qi$URXMOH
## [1] -Inf 0.8 1.4 2.0 2.7 3.5 4.4 5.6 7.6 11.6 Inf
##
## $qi$URXMHP
## [1] -Inf 1.0 1.3 1.8 2.5 4.1 Inf
##
## $qi$URXMHH
## [1] -Inf 1.3 2.2 3.1 4.2 5.3 6.9 8.7 11.7 18.3 Inf
##
## $qi$URXMCOH
## [1] -Inf 0.5 0.7 1.0 2.0 Inf
##
## $qi$URXMHNC
## [1] -Inf 0.5 0.7 1.0 1.5 3.3 Inf
##
## $qi$URXECP
## [1] -Inf 2.20 3.60 5.00 6.60 8.40 10.40 13.50 17.60 26.38 Inf
##
## $qi$URXHIBP
## [1] -Inf 0.6 1.0 1.4 1.9 2.4 3.1 4.1 5.6 8.9 Inf
##
## $qi$URXMIB
## [1] -Inf 2.00 3.40 4.90 6.50 8.40 10.70 13.64 18.70 28.40 Inf
##
##
## $wqs
## [1] 4.493723
Using our model, the predicted log(BMI) is 3.39. This method is essential for extrapolating results, an important aspect of regression models.
We check the assumption that all the chemical exposures contribute in the same direction using our expertise.
The other assumptions are similar to those of linear regressions. Therefore, we also examine the residuals. Residuals are the differences between the the actual data points in the NHANES dataset and the predicted values from our gWQS model. By examining the residuals on a graph, we can determine if there are any relationships, causing issues with the model.
gwqs_fitted_vs_resid(results)
The final visualization shows the residuals and fitted values. As seen above, the points are distributed around zero and do not have any distinct patterns. Additional interpretations of the output models can be found HERE.
results$fit
##
## Call: glm(formula = formula, family = family, data = dtf, weights = wghts)
##
## Coefficients:
## (Intercept) wqs
## 3.31821 0.01492
##
## Degrees of Freedom: 2113 Total (i.e. Null); 2112 Residual
## Null Deviance: 115.6
## Residual Deviance: 114.2 AIC: -164.6
Null deviance can show how well the model predicts the response, solely using the regression intercept.
Contrary, residual deviance demonstrates how well the model predicts the response, using the explanatory variables. The smaller the residual deviance, the better the accuracy of the regression model.
AIC - This is a statistic to compare two models - the smaller the AIC, the better the model.
For example, we can demonstrate AIC by adding covariates into the model. Covariates in the model are variables that we are not necessarily interested in interpreting, but may be influential to the overall model. By adding covariates (such as age), we can compare how well each gWQS model can predict BMI by examining the AIC metric.
In order to add a covariate using the gWQS package, the varible must be added to the regression equation, shown below. This ensures that age is not treated as a mixture, but a separate variable.
nhanes_new <- nhanes1518 |>
filter(RIDAGEYR >= 18)|>
mutate(BMXBMI = log(BMXBMI))|>
select(BMXBMI, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB,RIDAGEYR)
chem_names_new <- names(nhanes_new)[2:10]
results_new <- gwqs(BMXBMI ~ wqs + RIDAGEYR, mix_name = chem_names_new, data = nhanes_new,
q = 10, validation = 0.6, b = 1, b1_pos = TRUE,
b1_constr = FALSE, family = "gaussian", seed=1)
summary(results_new)
##
## Call:
## gwqs(formula = BMXBMI ~ wqs + RIDAGEYR, data = nhanes_new, mix_name = chem_names_new,
## b = 1, b1_pos = TRUE, b1_constr = FALSE, q = 10, validation = 0.6,
## family = "gaussian", seed = 1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.57980 -0.16089 -0.01178 0.14365 0.99637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2515795 0.0170586 190.612 < 2e-16 ***
## wqs 0.0160735 0.0030277 5.309 1.22e-07 ***
## RIDAGEYR 0.0013555 0.0002748 4.933 8.72e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.05354714)
##
## Null deviance: 115.65 on 2113 degrees of freedom
## Residual deviance: 113.04 on 2111 degrees of freedom
## AIC: -183.82
##
## Number of Fisher Scoring iterations: 2
Interpretation of the model:
Notice that we center our age at 18 in previous pre-processing step.
The coefficient for age can be interpreted, similar to a standard linear model:
The interpretation for wqs changes slightly from the example listed above since there is an additional variable:
We can confirm that age is being treated as a covariate by checking the mean weights given to the mixture variables. It is critical to ensure that age is not assigned a weight or included with the URX variables.
head(results_new$final_weights)
## mix_name mean_weight
## URXMCOH URXMCOH 3.734351e-01
## URXECP URXECP 3.624844e-01
## URXMIB URXMIB 1.570369e-01
## URXMHNC URXMHNC 1.070436e-01
## URXMOH URXMOH 6.926465e-08
## URXMHH URXMHH 6.492087e-08
In this example, the AIC from the original model was -164.58 while the new model with age as a covariate has an AIC of -183.82. Therefore, we can say that the second model with age added is a better model, more effectively predicting log BMI.
Additional outputs from the model which may be helpful for diagnostics:
results$bres
- displays the p-values for each of the
bootstrap samples
results$q_i
- the “cutoff” values which are used to
create the quantiles
In conclusion, the gWQS weighted the explanatory variables successfully as our model is considered to be statistically significant. By using the gWQS model with data where explanatory variables are all highly correlated, specific variables can be weighted according to their influence to the response variable. In the NHANES data set, chemicals URXMCOH, URXMIB, URXECP were weighted the highest by the gWQS model.
qgcomp
gWQS
provides a great introduction to the applications
and fundamentals of highly correlated mixtures. It is important to note,
there are additional methods and packages that use weighted quantile
sums. For example, qgcomp
allows us to use a similar
method, but has faster computation and allows more
flexibility in the assumptions and data that can be
used in the model. In the tutorial of qgcomp, we will
introduce quantile g-computation (qgcomp
package), a new
method for modeling chemical mixtures, that can solve similar but more
complex problems.