Assume we are interested in exploring the relationship between the chemical exposures and the outcome (BMI for example). These chemical exposures are usually highly correlated with each other (known as multicollinearity phenomenon in regression). Multicollinearity among the predictors impedes model inference and prediction. The standard errors for our regression coefficients will inflate, and thus we will lose precision in our estimates of the regression coefficients. To eliminate the issue of multicollinearity, we can consider summarizing these highly correlated chemical mixtures when putting them into a regression model. The idea of summary motivates the weighted quantile sum regression.

Unlike ordinary multivariate regression, the Weighted Quantile Sum Regression (WQS) model uses quantiles of chemical exposures and weighs them based on their individual impacts on the overall outcome. The exposure that has a strong impact on the mixture are assigned a greater weight while others that are not as critical are given a lower weight. The weighted sum of quantiles of chemical exposures is then treated as a predictor and regress on the response variable. We utilize the r package gWQS to illustrate model fitting, prediction, and evaluation.

We recommend another package qgcomp, which allows us to use a similar method, but has faster computation and allows more flexibility in the assumptions and data that can be used in the model.

Libraries

In order to investigate relationships between chemical exposures using the gWQS package, there are several packages that must be installed in R. Similar to previous tutorials where packages were introduced, we will be using tidyverse and tidymodels for data wrangling and model building, but we will also use ggcorrplot for data visualization.

In order to use the new packages in this tutorial, they must first be installed in R by running the following lines of code install.packages("ggcorrplot") and install.packages("gWQS")in the console.

Additional installations may be required for mac users!

library(gWQS)
library(tidyverse)
library(tidymodels)
library(ggcorrplot)

In this tutorial, we use the new base R pipe operator “|>” which is equivalent to “%>%”

Exploratory Data Analysis

As discussed earlier, we want to apply Weighted Quantile Sums to a real world issue. This example uses the NHANES dataset to focus on specific chemical exposures: Phalates and Phytoestrogens. Both of these exposures can be detected in urine samples and we want to estimate their respective impact on the outcome, body mass index. In order to predict body mass index with urine samples, we first need to reintroduce the NHANES dataset for use in this example.

The NHANES dataset can be loaded into R:

load(file='nhanes1518.rda')
head(nhanes1518)
## # A tibble: 6 × 52
##    SEQN WTINT2YR WTMEC2YR RIDSTATR RIAGENDR RIDAGEYR INDFMPIR RIDRETH1 DMDEDUC2
##   <int>    <dbl>    <dbl>    <int>    <int>    <int>    <dbl>    <int>    <int>
## 1 83732  134671.  135630.        2        1       62     4.39        3        5
## 2 83733   24329.   25282.        2        1       53     1.32        3        3
## 3 83734   12400.   12576.        2        1       78     1.51        3        3
## 4 83735  102718.  102079.        2        2       56     5           3        5
## 5 83736   17628.   18235.        2        2       42     1.23        4        4
## 6 83737   11252.   10879.        2        2       72     2.82        1        2
## # ℹ 43 more variables: DMDEDUC3 <int>, INDHHIN2 <int>, BMXBMI <dbl>,
## #   BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## #   URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## #   URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## #   URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## #   URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## #   URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …

Once the data is loaded, we can proceed to understand how our chemical exposures (detected in urine samples) impact the body mass index of participants.

Similar to the previous tutorial where age is filtered, we only want to consider participants in the data that are above 18 years old and take the natural log of BMI. By filtering and using the natural log, the distribution of the ages of participants are normalized. To finish preparing the dataset, the N/A values are removed and the resulting data is saved under the name nhanes.

nhanes <- nhanes1518 |>
    filter(RIDAGEYR >= 18)|>
    mutate(BMXBMI = log(BMXBMI),
          RIDAGEYR = RIDAGEYR - 18)|> 
    select(BMXBMI, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB)

nhanes <- drop_na(nhanes)

nhanes
## # A tibble: 3,523 × 10
##    BMXBMI URXMHBP URXMOH URXMHP URXMHH URXMCOH URXMHNC URXECP URXHIBP URXMIB
##     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
##  1   3.01    5.3     7.8   2       8.7    0.7     1.2    14.1    27.4   73.7
##  2   3.18    0.5     8.6   3.1    11.9    0.35    0.28   16.7     1.5    4.4
##  3   3.78    1.1     8.4   0.57   16.4    0.6     0.28   25.8     6.3   32.7
##  4   3.64    1.5     7.8   1.2    14.4    0.6     1.1    19.7     4.7    8.8
##  5   3.27    0.5     1.7   0.57    2.7    0.6     0.9     3.9     2      7.1
##  6   3.22    1.9    15.4   8.2    27.6    0.35    0.5    26.1     6.1   13.4
##  7   3.13    0.28    6     5.1     9.5   14.2    72.2    14.1     3.7   17  
##  8   3.53    1.8     1.9   2.1     2.7    0.35    0.4     4       4.9   14.7
##  9   3.43    1.5     5     2.6     6.2    0.35    0.28    9.4     1.4    3.7
## 10   3.53    0.28    2.6   0.57    5.1    0.9     0.8     5.9     3.2   12.8
## # ℹ 3,513 more rows

To confirm the normality of the variable BMI, we can create a histogram to view the shape of the observations:

 nhanes|> ggplot(aes(x = BMXBMI))+ 
   geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Key Assumptions

In order to use the gWQS package in an application setting, several conditions must be followed. The first condition is that the chemical exposures must ALL contribute in one direction. In order to contribute in one direction, all variables must be either positively correlated, or negatively correlated with each other. In addition to the correlation of the exposures, the dataset must be split into two sections. The first section is the training data for providing weights to the variables, while the second is used exclusively for verifying the model and ensuring that the results are reliable.

In order to use weighted quantile sum regression with the nhanes data, it is important to first confirm that all constraints are satisfied:

An initial correlation model reveals that the URX predictor variables all contribute in a single direction, validating the appropriateness of the gWQS model.

cor_hanes <- cor(nhanes)

ggcorrplot(cor_hanes)+
  labs(title = "Corrleation of URX Variables and BMI")

#Source http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2#:~:text=The%20easiest%20way%20to%20visualize,ggcorr()%20in%20ggally%20package

However, if we fit a normal linear regression model to these predictors, positive and negative estimates would be produced. Intuitively, the URX variables should all have positive coefficients, however, this model is not representative of this assumption. This is a demonstration of multi-collinearity, where there are there is a large uncertainty (variance) when calculating the coefficients. Therefore, the output of the model may produce unusual signs for the beta estimates.

As shown below, some of the URX betas are positive, while others are negative. Instead, we want to use a gWQS model which will provide a unidirectional output, similar to a standard linear regression model.

#large uncertanty (variance) around the coefficients and multicolinearity (including that the coeffients may be negatve) - point estimate may be negative (abnormal signs)
linear_reg()|>
  fit(BMXBMI~., data = nhanes)|>
tidy()
## # A tibble: 10 × 5
##    term          estimate std.error statistic   p.value
##    <chr>            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)  3.36       0.00509   660.     0        
##  2 URXMHBP     -0.0000177  0.000839   -0.0211 0.983    
##  3 URXMOH       0.00209    0.00174     1.21   0.228    
##  4 URXMHP      -0.00704    0.00164    -4.29   0.0000182
##  5 URXMHH       0.000405   0.00101     0.401  0.688    
##  6 URXMCOH      0.00238    0.00105     2.26   0.0236   
##  7 URXMHNC     -0.000869   0.000402   -2.16   0.0306   
##  8 URXECP      -0.0000514  0.000285   -0.180  0.857    
##  9 URXHIBP     -0.00567    0.00155    -3.67   0.000250 
## 10 URXMIB       0.00185    0.000464    3.98   0.0000693

WQS Model

Once the conditions are satisfied, a model can be produced, similarly to standard linear regression model. One difference from the linear model is that unique weights are assigned to the predictor variables.

\[ g(E(Y))= \beta_0 + \beta_0 \times (WQS) \] or it can be written as an expanded model where \(w_n\) corresponds the estimated weights for each X variable and \(\theta_1\) is the predicted impact of the WQS method on all of the chemical exposures.

\[ g(E(Y))= \beta_0 + \beta_1(w_1X_1 + w_2X_2 + ... w_nX_n) \]

Each outcome variable is assigned a “score”, meaning they are placed in bins using quantiles qi = {0, 1, 2, 3, etc.} For each exposure, there are cutoff points (or weights) that determine where each variable is placed. The weighted components are then summed and are fitted to the regression model. gWQS visualizes the assigned weights and regression fit as demonstrated in the package below:

Model Fitting

The gWQS model requires the explanatory mixtures to be saved to a variable. In this example, we saved them to chem_names.

Additionally, the output of the gWQS package is saved to the variable results.

Key arguments in the qWQS package:

q - specifies the number of quantiles to be used

validation - the percentage of the data set to be used for creating the weights/bootstrapping and the validating the model

b1_pos - specify TRUE if the data has a positive correlation and

FALSE if it has a negative correlation

family - type of association to be tested (ex: gaussian, binomial)

data - the data-frame containing the data to be tested

b - number of bootstrap samples

mix_name - a vector containing the names of the exposures

seed - set a seed which allows the results to be replicated

chem_names <- names(nhanes)[2:10]

results <- gwqs(BMXBMI ~ wqs, mix_name = chem_names, data = nhanes, 
                q = 10, validation = 0.6, b = 1, b1_pos = TRUE, 
                b1_constr = FALSE, family = "gaussian", seed=1)

It is important to set a seed in the gWQS model. This ensures that the results are replicated consistently when the model is rendered.

Model Interpretation and Inference

The estimated weights are shown numerically below:

head(results$final_weights)
##         mix_name  mean_weight
## URXECP    URXECP 4.635526e-01
## URXMCOH  URXMCOH 4.231238e-01
## URXMIB    URXMIB 1.133201e-01
## URXMHH    URXMHH 1.766438e-06
## URXMOH    URXMOH 1.535145e-06
## URXMHNC  URXMHNC 1.131085e-07

Using these weights, the final regression results can be obtained:

summary(results)
## 
## Call:
## gwqs(formula = BMXBMI ~ wqs, data = nhanes, mix_name = chem_names, 
##     b = 1, b1_pos = TRUE, b1_constr = FALSE, q = 10, validation = 0.6, 
##     family = "gaussian", seed = 1)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.57623  -0.16121  -0.01061   0.14323   0.95830  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.318215   0.009803 338.499  < 2e-16 ***
## wqs         0.014915   0.002861   5.214 2.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.05406224)
## 
##     Null deviance: 115.65  on 2113  degrees of freedom
## Residual deviance: 114.18  on 2112  degrees of freedom
## AIC: -164.58
## 
## Number of Fisher Scoring iterations: 2

The Regression output:

\[ \widehat{log(\text{BMXBMI})} = 3.31 + 0.015 \times (WQS) \]

Interpretation of the model:

  • When quantiles of the URX variables are 0, the estimated log BMI is approximately 3.31.

  • As every quantile of the exposures increases by 1, the percent change in BMI increases, on average, by 1.5113%.

It is important to acknowledge the uncertainty in the WQS output. One method of accounting for this uncertainty is through confidence intervals. A confidence interval allows us to understand between which values are the estimations from our model likely to occur in the real-world, accounting for model variations. Using the confint(model name) function in R, we can determine between which WQS coefficient by the model.

confint(results)
##                   2.5 %     97.5 %
## (Intercept) 3.299001588 3.33742763
## wqs         0.009308286 0.02052193

We can be 95% confident that the true wqs coefficient falls within the interval 0.009 and 0.021.

We may also be interested in testing whether URX variables have significant impact on BMI. Included in the regression output is the results significance tests of the coefficients. To test the significance of the model, we can use a t-distribution and t-statistic. The t-statistic for WQS is 5.214 using the t-distribution. The p-value corresponding to the WQS quantiles is extremely small (2.03e-07). Therefore, there is a significant relationship between the weighted quantiles and the predicted outcome, logBMI.

The gWQS package provides methods for understanding and visualizing the results. One of the most helpful ways to view the output of is through visualizations of the weights assigned to each variable. The weights can demonstrate which exposures from the urine samples are impactful in predicting BMI and which are less helpful.

gwqs_barplot(results)

The first visualization produced is a bar plot showing the weights assigned to each of the URX variables. The variables with greatest weights are URXCEP, URXMOH, and URHMIB while the remaining chemicals are not great predictors of BMI. Also shown in the visualization is a red line which shows the cutoff values - calculated by the inverse of the number of predictor variables.

In addition to the weights assigned to the chemical exposures, it can be helpful to visualize the overall relationship of the WQS model and the predicted BMI. To visualize this relationship, a scatterplot can be utilized to see if there is a linear relationship in the model.

gwqs_scatterplot(results)
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot of the WQS model and BMI demonstrate the linear relationship between the URX chemicals and log BMI.

Model Prediction

Most importantly, we can use the WQS regression model to predict our outcome variable (log BMI). Using our model above, we can use the predict() function to determine the corresponding result. The predict function takes our WQS model and the dataframe containing the data for our prediction respectively. Values are from a single data point in the original NHANES dataset.

new_data <- data.frame(
URXMHBP = 12.9,
URXMOH = 6.3,
URXMHP = 2.10,
URXMHH = 26.2,
URXMCOH = 1.40,
URXMHNC = 2.30,
URXECP = 9.7,
URXHIBP = 19.9,
URXMIB = 23.5,
BMXBMI = 3.718438
)
predict(results, newdata = new_data)
## $df_pred
##          y    ypred
## 1 3.718438 3.385239
## 
## $Q
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,]    7    7    3    9    3    4    5    9    8
## 
## $qi
## $qi$URXMHBP
## [1] -Inf  0.4  0.6  0.8  1.0  1.4  1.8  3.0  Inf
## 
## $qi$URXMOH
##  [1] -Inf  0.8  1.4  2.0  2.7  3.5  4.4  5.6  7.6 11.6  Inf
## 
## $qi$URXMHP
## [1] -Inf  1.0  1.3  1.8  2.5  4.1  Inf
## 
## $qi$URXMHH
##  [1] -Inf  1.3  2.2  3.1  4.2  5.3  6.9  8.7 11.7 18.3  Inf
## 
## $qi$URXMCOH
## [1] -Inf  0.5  0.7  1.0  2.0  Inf
## 
## $qi$URXMHNC
## [1] -Inf  0.5  0.7  1.0  1.5  3.3  Inf
## 
## $qi$URXECP
##  [1]  -Inf  2.20  3.60  5.00  6.60  8.40 10.40 13.50 17.60 26.38   Inf
## 
## $qi$URXHIBP
##  [1] -Inf  0.6  1.0  1.4  1.9  2.4  3.1  4.1  5.6  8.9  Inf
## 
## $qi$URXMIB
##  [1]  -Inf  2.00  3.40  4.90  6.50  8.40 10.70 13.64 18.70 28.40   Inf
## 
## 
## $wqs
## [1] 4.493723

Using our model, the predicted log(BMI) is 3.39. This method is essential for extrapolating results, an important aspect of regression models.

Model Diagnostics

We check the assumption that all the chemical exposures contribute in the same direction using our expertise.

The other assumptions are similar to those of linear regressions. Therefore, we also examine the residuals. Residuals are the differences between the the actual data points in the NHANES dataset and the predicted values from our gWQS model. By examining the residuals on a graph, we can determine if there are any relationships, causing issues with the model.

gwqs_fitted_vs_resid(results)

The final visualization shows the residuals and fitted values. As seen above, the points are distributed around zero and do not have any distinct patterns. Additional interpretations of the output models can be found HERE.

Model Comparison

results$fit
## 
## Call:  glm(formula = formula, family = family, data = dtf, weights = wghts)
## 
## Coefficients:
## (Intercept)          wqs  
##     3.31821      0.01492  
## 
## Degrees of Freedom: 2113 Total (i.e. Null);  2112 Residual
## Null Deviance:       115.6 
## Residual Deviance: 114.2     AIC: -164.6
  • Null deviance can show how well the model predicts the response, solely using the regression intercept.

  • Contrary, residual deviance demonstrates how well the model predicts the response, using the explanatory variables. The smaller the residual deviance, the better the accuracy of the regression model.

  • AIC - This is a statistic to compare two models - the smaller the AIC, the better the model.

For example, we can demonstrate AIC by adding covariates into the model. Covariates in the model are variables that we are not necessarily interested in interpreting, but may be influential to the overall model. By adding covariates (such as age), we can compare how well each gWQS model can predict BMI by examining the AIC metric.

In order to add a covariate using the gWQS package, the varible must be added to the regression equation, shown below. This ensures that age is not treated as a mixture, but a separate variable.

nhanes_new <- nhanes1518 |>
    filter(RIDAGEYR >= 18)|>
    mutate(BMXBMI = log(BMXBMI))|>
    select(BMXBMI, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB,RIDAGEYR)


chem_names_new <- names(nhanes_new)[2:10]

results_new <- gwqs(BMXBMI ~ wqs + RIDAGEYR, mix_name = chem_names_new, data = nhanes_new, 
                q = 10, validation = 0.6, b = 1, b1_pos = TRUE, 
                b1_constr = FALSE, family = "gaussian", seed=1)
summary(results_new)
## 
## Call:
## gwqs(formula = BMXBMI ~ wqs + RIDAGEYR, data = nhanes_new, mix_name = chem_names_new, 
##     b = 1, b1_pos = TRUE, b1_constr = FALSE, q = 10, validation = 0.6, 
##     family = "gaussian", seed = 1)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.57980  -0.16089  -0.01178   0.14365   0.99637  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.2515795  0.0170586 190.612  < 2e-16 ***
## wqs         0.0160735  0.0030277   5.309 1.22e-07 ***
## RIDAGEYR    0.0013555  0.0002748   4.933 8.72e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.05354714)
## 
##     Null deviance: 115.65  on 2113  degrees of freedom
## Residual deviance: 113.04  on 2111  degrees of freedom
## AIC: -183.82
## 
## Number of Fisher Scoring iterations: 2

Interpretation of the model:

  • When quantiles of the URX variables are 0 and age of the individual is 18, the estimated log BMI is approximately 3.25.

Notice that we center our age at 18 in previous pre-processing step.

The coefficient for age can be interpreted, similar to a standard linear model:

  • For every one unit increase in the age of the person, the log BMI is predicted to increase by, on average, 0.0014, when quantiles of the URX variables are held constant.

The interpretation for wqs changes slightly from the example listed above since there is an additional variable:

  • As every quantile of the exposures increases by 1, log BMI increases, on average, by 0.015, holding age constant.

We can confirm that age is being treated as a covariate by checking the mean weights given to the mixture variables. It is critical to ensure that age is not assigned a weight or included with the URX variables.

head(results_new$final_weights)
##         mix_name  mean_weight
## URXMCOH  URXMCOH 3.734351e-01
## URXECP    URXECP 3.624844e-01
## URXMIB    URXMIB 1.570369e-01
## URXMHNC  URXMHNC 1.070436e-01
## URXMOH    URXMOH 6.926465e-08
## URXMHH    URXMHH 6.492087e-08

In this example, the AIC from the original model was -164.58 while the new model with age as a covariate has an AIC of -183.82. Therefore, we can say that the second model with age added is a better model, more effectively predicting log BMI.

Miscellaneous

Additional outputs from the model which may be helpful for diagnostics:

results$bres - displays the p-values for each of the bootstrap samples

results$q_i - the “cutoff” values which are used to create the quantiles

Conclusion

In conclusion, the gWQS weighted the explanatory variables successfully as our model is considered to be statistically significant. By using the gWQS model with data where explanatory variables are all highly correlated, specific variables can be weighted according to their influence to the response variable. In the NHANES data set, chemicals URXMCOH, URXMIB, URXECP were weighted the highest by the gWQS model.

Alternative package: qgcomp

gWQS provides a great introduction to the applications and fundamentals of highly correlated mixtures. It is important to note, there are additional methods and packages that use weighted quantile sums. For example, qgcomp allows us to use a similar method, but has faster computation and allows more flexibility in the assumptions and data that can be used in the model. In the tutorial of qgcomp, we will introduce quantile g-computation (qgcomp package), a new method for modeling chemical mixtures, that can solve similar but more complex problems.