qgcomp Package

Weighted Quantile Sum Regression (WQS) allows researchers to understand the effects of highly correlated components. In a previous tutorial, we used the gWQS package to implement modeling fitting, inference, and diagnostics. An additional method used to employ Weighted Quantile Sum Regression is the quantile g-computation. Quantile g-computation is firstly developed as a causal effect estimating method, and is implemented by qgcomp package. Quantile g-computation relaxes the directional homogeneity assumption of WQS, here the components do not have to contribute in the same direction. Compared to WQS, g-computation can provide unbiased estimation and appropriate confidence interval coverage. Additionally, compared to gWQS packge, qgcomp package has greater computation efficiency and it allows more flexibility on modeling non-linear relationship.

In this tutorial, we will demonstrate how the qgcomp package can provide additional flexibility when analyzing chemical mixtures. Specifically, we will explore applications for (1) linear model (2) linear model with extra covariates (3) non-linear model (4) missing data, limits of detection, and multiple imputation.

Libraries

For this tutorial, we will be using similar packages to the previous tutorials. Note that in addition to qgcomp, we will also use the splines package for [Non-Linear Models] later in the tutorial.

#| install.packages("gWQS")
#| install.packages("qgcomp")
#| install.packages("corrplot")
#| install.packages("ggcorrplot")
#| install.packages("splines")
#| 
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggcorrplot)
library(corrplot)

Exploratory Data Analysis

For this analysis, we will be using the nhanes dataset, which can be loaded into R studio.

load(file='nhanes1518.rda')
head(nhanes1518)
## # A tibble: 6 × 52
##    SEQN WTINT2YR WTMEC2YR RIDSTATR RIAGENDR RIDAGEYR INDFMPIR RIDRETH1 DMDEDUC2
##   <int>    <dbl>    <dbl>    <int>    <int>    <int>    <dbl>    <int>    <int>
## 1 83732  134671.  135630.        2        1       62     4.39        3        5
## 2 83733   24329.   25282.        2        1       53     1.32        3        3
## 3 83734   12400.   12576.        2        1       78     1.51        3        3
## 4 83735  102718.  102079.        2        2       56     5           3        5
## 5 83736   17628.   18235.        2        2       42     1.23        4        4
## 6 83737   11252.   10879.        2        2       72     2.82        1        2
## # ℹ 43 more variables: DMDEDUC3 <int>, INDHHIN2 <int>, BMXBMI <dbl>,
## #   BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## #   URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## #   URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## #   URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## #   URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## #   URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …

Consistent with prior tutorials, we hope to study the impacts of specific chemical exposures (Phalates and Phytoestrogen) on log(BMI), for participants over 18. The nhanes data can be normalized through this method.

nhanes <- nhanes1518 |>
    filter(RIDAGEYR >= 18)|>
    mutate(BMXBMIlog = log(BMXBMI),
          RIDAGEYR = RIDAGEYR - 18)|>
    select(BMXBMI, BMXBMIlog, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)

nhanes <- drop_na(nhanes)

nhanes
## # A tibble: 3,523 × 12
##    BMXBMI BMXBMIlog URXMHBP URXMOH URXMHP URXMHH URXMCOH URXMHNC URXECP URXHIBP
##     <dbl>     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1   20.3      3.01    5.3     7.8   2       8.7    0.7     1.2    14.1    27.4
##  2   24.1      3.18    0.5     8.6   3.1    11.9    0.35    0.28   16.7     1.5
##  3   43.7      3.78    1.1     8.4   0.57   16.4    0.6     0.28   25.8     6.3
##  4   38        3.64    1.5     7.8   1.2    14.4    0.6     1.1    19.7     4.7
##  5   26.3      3.27    0.5     1.7   0.57    2.7    0.6     0.9     3.9     2  
##  6   25        3.22    1.9    15.4   8.2    27.6    0.35    0.5    26.1     6.1
##  7   22.8      3.13    0.28    6     5.1     9.5   14.2    72.2    14.1     3.7
##  8   34        3.53    1.8     1.9   2.1     2.7    0.35    0.4     4       4.9
##  9   31        3.43    1.5     5     2.6     6.2    0.35    0.28    9.4     1.4
## 10   34        3.53    0.28    2.6   0.57    5.1    0.9     0.8     5.9     3.2
## # ℹ 3,513 more rows
## # ℹ 2 more variables: URXMIB <dbl>, RIDAGEYR <dbl>

qgcomp Model

The qgcomp package shares similarities to gWQS, but there are several key distinctions. Primarily, an essential assumption from gWQS was that all predictor variables must contribute to the same direction. However, qgcomp can fit a model regardless of the contributing direction of the variables.

Similar to gWQS, qgcomp can be applied to a generalized linear regression model:

\[\mathbb{E}(Y | \mathbf{Z,\psi,\eta}) = g(\psi_0 + \psi_1 S_q + \mathbf{\eta Z})\]

The model is called Marginal Structural Model (MSM). In the model, \(g(\cdot)\) is a link function. \(\psi_0\) represents the intercept. \(S_q\) is the combined effect of the exposures (predictor variables) \(S_q =\sum_{j=1}^d w_j{\mathbf{X}_{q,j}}\), where \(\mathbf{X}_q\) is the transformation of of the X variable that has been sorted into quantiles and \(w_j\) represents the weights of quantiles. \(\mathbf{Z}\) represents extra covariates, and \(\eta\) represents corresponding coefficients.

Linear Model

library("qgcomp")
library(gWQS) # for comparison
## Welcome to Weighted Quantile Sum (WQS) Regression.
## If you are using a Mac you have to install XQuartz.
## You can download it from: https://www.xquartz.org/

qgcomp has similar arguments to the gWQS package. qgcomp() wraps functions qgcomp.glm.noboot() and qgcomp.glm.boot(), corresponding to non-bootstrap-based inference and bootstrap-based inference. qgcomp() will select qgcomp.glm.noboot() for linear model, and qgcomp.glm.boot() for nonlinear model. Some core arguements are listed here:

expnms - character vector of exposures of interest (variables we put in weighted sum). By default it will include all the variables in the dat argument except for the response variable.

q - specifies the number of quantiles to be used.

B - number of bootstrap samples when using qgcomp.glm.boot(). If we fit a non-linear model, we will employ bootstrap for simulation based inference. Usually we use the default value, which is 200.

seed - random seed for bootstrap-based inference. Recommend to specify the seed when fitting nonlinear model (enable reproducing your results).

degree - degree of polynomial regression models. For example, degree=2 corresponds to a quadratic regression.

Predicting log(BMI) using the same chemical exposures (Phalates and Phytoestrogen) from the gWQS tutorial, a similar result can be accomplished in a shorter timeframe.

chem_names_new <- c('URXMHBP', 'URXMOH',  'URXMHP',  'URXMHH',  'URXMCOH', 'URXMHNC', 'URXECP', 'URXHIBP', 'URXMIB')
system.time(qc.fit <- qgcomp(BMXBMIlog~.,
                             q = 4, # set quantile levels 
                             dat=nhanes[,c(chem_names_new, 'BMXBMIlog')], 
                             family=gaussian()))
## Including all model terms as exposures of interest
##    user  system elapsed 
##   0.018   0.001   0.018
system.time(results <- gwqs(BMXBMIlog ~ wqs, mix_name = chem_names_new, data = nhanes, 
                            q = 4, # set quantile levels
                            b = 1, # only set 1 bootstrap sample, not recommended for estimation, just use for time comparison
                b1_pos = TRUE, b1_constr = FALSE, family = "gaussian", seed=1))
##    user  system elapsed 
##   0.098   0.007   0.105

The gWQS resulted in a longer elapsed time compared to the qgcomp model.

The output of the linear model is demonstrated below:

qc.fit
## Scaled effect size (positive direction, sum of positive coefficients = 0.106)
##  URXECP URXMCOH  URXMIB  URXMOH  URXMHH 
##  0.3130  0.2796  0.2688  0.0904  0.0482 
## 
## Scaled effect size (negative direction, sum of negative coefficients = -0.0875)
##  URXMHP URXHIBP URXMHBP URXMHNC 
##   0.497   0.266   0.162   0.076 
## 
## Mixture slope parameters (Delta method CI):
## 
##              Estimate Std. Error   Lower CI Upper CI  t value Pr(>|t|)
## (Intercept) 3.3199177  0.0073066  3.3055971 3.334238 454.3749  < 2e-16
## psi1        0.0188245  0.0111702 -0.0030687 0.040718   1.6852  0.09203

Similar to gWQS, a regression model output is provided which follows the same interpretation as above. The coefficients, standard errors, confidence intervals, t-statistics, and p-values are provided.

The Regression output:

\[ \widehat{log(\text{BMXBMI})} = 3.319 + 0.019 \times (WQS) \]

Unlike gWQS, the qgcomp package allows the variables to contribute in different directions, as demonstrated with the “weights”. Instead of unidirectional relationships, a weight is assigned to the individual chemical exposure corresponding to “the proportion of the overall effect when all of the exposures have effects in the same direction” (R tutorial of qgcomp package). The weights are shown in the figure below:

plot(qc.fit)

An interesting output of the qgcomp model is the sum of the positive and negative coefficients. In application, this appears to equate to a “partial effect”, the impact of a chemical with a positive/negative coefficient. However, it cannot be interpreted as a “true partial effect” due to the fact that the sum of the coefficients depends on the model results/fit.

In order to estimate the partial effects without the issues discussed above, a sample splitting procedure can be used by creating training and validation dataset. More information on how to complete this procedure can be found on the CRAN website.

Here provides the model diagnosis.

par(mfrow = c(2, 2))
plot(qc.fit$fit)

Examining the residual plots (figure 1 and figure 2), the datum appear to be fairly evenly spread about the axis. However, there appears to be a cluster of points centered between 3.3 and 3.5. From Q-Q plot, only very few points deviates from the theoretical quantiles. Only three data points marked as high leverage points. Despite this, there are generally no major concerns with the model diagnosis.

Regression with Covariates

A similarity to Weighted Quantile Sum Regression is the ability to adjust for covariates. As demonstrated in the WQS tutorial, we are interested if our model does a better job of predicting BMI while adjusting for the age variable, RIDAGEYR.

covars = c('RIDAGEYR')

qc.fit.covar <- qgcomp(BMXBMIlog~.,
                       expnms = chem_names_new,
                       dat=nhanes[,c(chem_names_new,covars, 'BMXBMIlog')], 
                       q = 4, family=gaussian())

qc.fit.covar
## Scaled effect size (positive direction, sum of positive coefficients = 0.0977)
##  URXECP  URXMIB URXMCOH  URXMOH  URXMHH 
##  0.3095  0.2912  0.2660  0.0835  0.0499 
## 
## Scaled effect size (negative direction, sum of negative coefficients = -0.0745)
##  URXMHP URXHIBP URXMHBP URXMHNC 
## 0.51079 0.27477 0.20702 0.00742 
## 
## Mixture slope parameters (Delta method CI):
## 
##             Estimate Std. Error  Lower CI Upper CI  t value Pr(>|t|)
## (Intercept) 3.286845   0.010479 3.2663069 3.307383 313.6686  < 2e-16
## psi1        0.023213   0.011186 0.0012888 0.045137   2.0752  0.03804

In order to determine if the model with the age covariate is better than our original model, we can once again compare the AIC:

#AIC For First Model
AIC(qc.fit)
## [1] -348.3578
#AIC For Covariate Model
AIC(qc.fit.covar)
## [1] -365.6584

The second model, with age added as a covariate, has a lower AIC. This allows us to conclude that log(BMI) can be better predicted with age included in the model.

Logistic Regression

Another useful application of qgcomp is to find the effect of mixtures on binary variables (variables that can be represented by 1’s “Yes” or 0’s “No”). In this example, we want to predict whether a participant is above the average BMI based on the chemical mixtures used in the previous example. We can create a binary variable corresponding for a participant that is greater than the average BMI (represented with 1) or below the average BMI (represented with 0).

The following code can automate this process.

nhanes <- nhanes |>
      mutate(BMXBMIOVER = ifelse(BMXBMI >= 24.9,1,0))

Now that the BMIOVER variable is now in terms of 1’s and 0’s, we can use qgcomp to predict if a participant is over the average BMI using the chemical mixtures. In the qgcomp function, we need to set family=binomial() for logistic regression.

chem_names_new <- c('URXMHBP', 'URXMOH',  'URXMHP',  'URXMHH',  'URXMCOH', 'URXMHNC', 'URXECP', 'URXHIBP', 'URXMIB')
qc.fit.logistic <- qgcomp(BMXBMIOVER~.,
                          dat=nhanes[,c(chem_names_new, 'BMXBMIOVER')], 
                          q = 4,
                          family=binomial(),
                          seed = 125)
## Including all model terms as exposures of interest
## Expected time to finish: 0.65 minutes
qc.fit.logistic
## Mixture log(RR) (bootstrap CI):
## 
##               Estimate Std. Error Lower CI  Upper CI Z value Pr(>|z|)
## (Intercept) -0.3809412  0.0240369 -0.42805 -0.333830 -15.848   <2e-16
## psi1        -0.0040821  0.0404028 -0.08327  0.075106  -0.101   0.9195

We can interpret both the intercept intercepts and coefficient for the models:

As the quantile sum increases by 1, an individual’s BMI being over vs. under the national average decreases by approximately 0.40% (exp(-0.004)-1).

When the quantile sum is approximately zero, the probability of an individual being over vs. under the national average is approximately 0.683 (exp(-0.381)).

Polynomial Regression Models

However, unlike gWQS, qgcomp can deal with non-linearity, where the relationship between the response and predictors do not follow a linear pattern. In the following example, by specifying . + .^2 the same predictor variables are transformed and squared in order to reflect a non-linear relationship: include extra quadratic terms for all predictors.

\[\mathbb{E}(Y | \mathbf{\psi}) = \psi_0 + \psi_1 S_q',\quad where \quad S_q'=S_q+S_q^2 \] Notice here \(\sum w_j (S_{q,j}+S_{q,j}^2) = 1\).

The model is still linear if we treat \(S_q'\) as new weighted quantile sum. The linear trend is reflected in the MSM fit line. (MSM stands for estimator based on Marginal Structural Model, which is shown above.)

qc.fit.quad1 <- qgcomp(BMXBMIlog~ . + .^2,
                       expnms=chem_names_new,
                       q = 4,
                       dat = nhanes[,c(chem_names_new, 'BMXBMIlog')],
                       family=gaussian(),seed=125)
qc.fit.quad1
## Mixture slope parameters (bootstrap CI):
## 
##              Estimate Std. Error Lower CI Upper CI  t value Pr(>|t|)
## (Intercept)  3.346262   0.028690  3.29003 3.402493 116.6371   <2e-16
## psi1        -0.039581   0.047798 -0.13326 0.054102  -0.8281   0.4077
plot(qc.fit.quad1)

We can introduce degree = 2 to introduce quadratic trend.

\[\mathbb{E}(Y | \mathbf{\psi}) = \psi_0 + \psi_1 S_q + \psi_2 S_q^2\]

Notice here \(\sum w_jS_{q,j} = 1, \quad \sum w_jS_{q,j}^2 = 1\). The MSM fit linear present a quadratic trend. The interpretation is similar to that of generalized linear model with quadratic terms.

qc.fit.quad2 <- qgcomp(BMXBMIlog~ . + .^2,
                         expnms=chem_names_new,
                         dat = nhanes[,c(chem_names_new, 'BMXBMIlog')], 
                         q=4, 
                         degree = 2, # control the quadratic 
                         family=gaussian(), seed=125)
plot(qc.fit.quad2)

qc.fit.quad2
## Mixture slope parameters (bootstrap CI):
## 
##               Estimate Std. Error  Lower CI Upper CI  t value Pr(>|t|)
## (Intercept)  3.3114870  0.0086998  3.294436 3.328538 380.6379   <2e-16
## psi1         0.0647453  0.0468907 -0.027159 0.156649   1.3808   0.1674
## psi2        -0.0347754  0.0278097 -0.089281 0.019731  -1.2505   0.2112

Non-linear regression model using splines

In the previous example, we specify the non-linear regression model based on our prior (quadratic terms). Here qgcomp can also allow learning of nonlinear trend by using splines. Splines models approximate an unknown nonlinear relationship using a “piecewise” polynomial regression method (as shown below):

To compute the models, a new package called splines will be used. For more information about the Splines package, please visit A review of spline function procedures in R.

library(splines)

Often with chemical mixtures, variables that have high correlations may have a non-linear trend. Exposures that are highly correlated with each other may have an underlying similarities, “a common environmental source or set of behaviors” (R tutorial of qgcomp package). First, a correlation matrix can be used to determine which chemicals found in urine samples have a strong correlation.

By viewing the following correlation matrix of the urine samples, we can see that several variables are highly correlated with each other. For this example, we will use the URXMOH and URXMHH variables since they have the highest pairwise correlation.

nhanescor <- nhanes|> select(URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)

cor_hanes <- cor(nhanescor)
round(cor_hanes,digits = 2) # present correlation in 2 digits
##          URXMHBP URXMOH URXMHP URXMHH URXMCOH URXMHNC URXECP URXHIBP URXMIB
## URXMHBP     1.00   0.13   0.08   0.11    0.03    0.02   0.10    0.24   0.19
## URXMOH      0.13   1.00   0.72   0.97    0.02    0.02   0.83    0.20   0.19
## URXMHP      0.08   0.72   1.00   0.75    0.02    0.02   0.59    0.14   0.16
## URXMHH      0.11   0.97   0.75   1.00    0.03    0.02   0.80    0.18   0.18
## URXMCOH     0.03   0.02   0.02   0.03    1.00    0.87   0.02    0.07   0.08
## URXMHNC     0.02   0.02   0.02   0.02    0.87    1.00   0.01    0.06   0.07
## URXECP      0.10   0.83   0.59   0.80    0.02    0.01   1.00    0.18   0.17
## URXHIBP     0.24   0.20   0.14   0.18    0.07    0.06   0.18    1.00   0.93
## URXMIB      0.19   0.19   0.16   0.18    0.08    0.07   0.17    0.93   1.00
## RIDAGEYR    0.02   0.03  -0.07   0.01   -0.03   -0.03   0.02   -0.08  -0.08
##          RIDAGEYR
## URXMHBP      0.02
## URXMOH       0.03
## URXMHP      -0.07
## URXMHH       0.01
## URXMCOH     -0.03
## URXMHNC     -0.03
## URXECP       0.02
## URXHIBP     -0.08
## URXMIB      -0.08
## RIDAGEYR     1.00

To find the best fit for the data, we compare the nonlinear model (splines) and linear model. To fit a model with splines, we only need to apply bs() to variables which we want to apply a spline transformation for.

qc.fit.nonlin <- qgcomp(BMXBMIlog ~ URXMHBP + bs(URXMOH) + URXMHP + bs(URXMHH) + URXMCOH + URXMHNC + URXECP +  URXHIBP + URXMIB,
                         expnms=chem_names_new,
                         nhanes[,c(chem_names_new, 'BMXBMIlog')], family=gaussian(), q=8, B=5)

qc.fit.nonlin
## Mixture slope parameters (bootstrap CI):
## 
##                Estimate  Std. Error   Lower CI  Upper CI  t value Pr(>|t|)
## (Intercept)  3.30752346  0.00634210  3.2950932 3.3199538 521.5184   <2e-16
## psi1        -0.00073031  0.00379397 -0.0081664 0.0067057  -0.1925   0.8474

Here we print out the coefficient estimation. We observe the variables which we applied splines presented as 3 terms, such as bs()1, bs()2, bs()3. Each of these term represent a different B-spline basis function which are not directly interpretable.

summary(qc.fit.nonlin$fit)$coefficients
##                 Estimate  Std. Error      t value     Pr(>|t|)
## (Intercept)  3.294195666 0.010374180 317.53792413 0.000000e+00
## URXMHBP     -0.007552883 0.003082389  -2.45033433 1.432086e-02
## bs(URXMOH)1  0.090332981 0.053642307   1.68398764 9.227300e-02
## bs(URXMOH)2  0.045028762 0.052532542   0.85715940 3.914153e-01
## bs(URXMOH)3  0.090930100 0.048191739   1.88683999 5.926433e-02
## URXMHP      -0.028385084 0.003774605  -7.52001349 6.916581e-14
## bs(URXMHH)1 -0.008169848 0.054051792  -0.15114851 8.798673e-01
## bs(URXMHH)2  0.001451855 0.050762731   0.02860080 9.771846e-01
## bs(URXMHH)3  0.004433084 0.045618474   0.09717739 9.225911e-01
## URXMCOH      0.013493984 0.006543123   2.06231560 3.925115e-02
## URXMHNC     -0.007130001 0.004951339  -1.44001491 1.499524e-01
## URXECP       0.013896047 0.004619666   3.00802026 2.648224e-03
## URXHIBP     -0.017128923 0.004084675  -4.19346028 2.815197e-05
## URXMIB       0.020223594 0.003817911   5.29703120 1.249438e-07

We can plot the model fitting, we observe the smooth conditional fit is nonliear which presents the nonlinear relationship captured by our splines.

plot(qc.fit.nonlin)

We can examine the residual plots using plot(model$fit). The residuals appear to generally follow the QQ plot and appear to be evenly scattered in the Residuals vs Fitted visualization.

par(mfrow = c(2, 2))
plot(qc.fit.nonlin$fit)

We compare splines regression with linear regression using AIC. The non-linear model using spines produces the lower AIC compared to the standard linear form. This means that the splines method produced a more appropriate model for the highly correlated urine samples.

qc.fit.lin <- qgcomp(BMXBMIlog ~ URXMHBP + URXMOH + URXMHP + URXMHH + URXMCOH + URXMHNC + URXECP +  URXHIBP + URXMIB,
                     expnms=chem_names_new,
                     q = 4,
                     dat = nhanes[,c(chem_names_new, 'BMXBMIlog')], family=gaussian())

AIC(qc.fit.lin)
## [1] -348.3578
AIC(qc.fit.nonlin)
## [1] -376.4991

Missing Data, Limits of Detection, and Multiple Imputation

It is often that datum may include missing entries, possibly causing concern for conclusions. Similar to linear regression, there are methods to handle these situations.

Primarily, a complete case analysis is often used for creating models. This means that all data entries have complete information, those with NA’s are removed entirely. Unlike other regression packages, qgcomp package requires the researcher to make their own dataset with the function complete.cases.

Additionally, missing data can become problematic when chemical exposures fall below the limit of detection. One method to correct this issue is to use very small values for the missing entries. As a general rule for qgcomp, if the proportion of the number of entries (below the limit of detection) are less than 1/(q), where q is the number of quantiles used, small values may be utilized.

Lastly, missing data can be handled by a method called multiple imputation. This procedure replaces entries without data with “specific values” so that all participants can be included in the study (those with missing data do not have to be excluded entirely). Similar to above, a new dataset can be created using multiple imputation. Suggested packages include mice, with the function mice.impute.leftcenslognorm for best interface results with qgcomp.

Miscellaneous

An additional feature of qgcomp is the use of time-to-event data analysis. This method helps predict whether and when an event happened. Standard linear models are unable to compute this metric, but it can be helpful in medical applications.

The qgcomp package also has the capabilities to compute models despite the data having covariance among observations (the error term is not independent for individual observations). To compensate for this issue, cluster-based sampling and bootstrapping can be used through qgcomp.glm.boot.

To read more about how to complete these steps and for additional applications of the qgcomp model, visit the vignettes.