Weighted Quantile Sum Regression (WQS) allows researchers to
understand the effects of highly correlated components. In a previous
tutorial, we used the gWQS
package to implement modeling
fitting, inference, and diagnostics. An additional method used to employ
Weighted Quantile Sum Regression is the quantile g-computation. Quantile
g-computation is firstly developed as a causal effect estimating method,
and is implemented by qgcomp
package. Quantile
g-computation relaxes the directional homogeneity
assumption of WQS, here the components do not have to
contribute in the same direction. Compared to WQS, g-computation can
provide unbiased estimation and appropriate confidence interval
coverage. Additionally, compared to gWQS
packge,
qgcomp
package has greater computation efficiency and it
allows more flexibility on modeling non-linear relationship.
In this tutorial, we will demonstrate how the qgcomp
package can provide additional flexibility when analyzing chemical
mixtures. Specifically, we will explore applications for (1) linear
model (2) linear model with extra covariates (3) non-linear model (4)
missing data, limits of detection, and multiple imputation.
For this tutorial, we will be using similar packages to the previous
tutorials. Note that in addition to qgcomp
, we will also
use the splines
package for [Non-Linear Models] later in
the tutorial.
#| install.packages("gWQS")
#| install.packages("qgcomp")
#| install.packages("corrplot")
#| install.packages("ggcorrplot")
#| install.packages("splines")
#|
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggcorrplot)
library(corrplot)
For this analysis, we will be using the nhanes
dataset,
which can be loaded into R studio.
load(file='nhanes1518.rda')
head(nhanes1518)
## # A tibble: 6 × 52
## SEQN WTINT2YR WTMEC2YR RIDSTATR RIAGENDR RIDAGEYR INDFMPIR RIDRETH1 DMDEDUC2
## <int> <dbl> <dbl> <int> <int> <int> <dbl> <int> <int>
## 1 83732 134671. 135630. 2 1 62 4.39 3 5
## 2 83733 24329. 25282. 2 1 53 1.32 3 3
## 3 83734 12400. 12576. 2 1 78 1.51 3 3
## 4 83735 102718. 102079. 2 2 56 5 3 5
## 5 83736 17628. 18235. 2 2 42 1.23 4 4
## 6 83737 11252. 10879. 2 2 72 2.82 1 2
## # ℹ 43 more variables: DMDEDUC3 <int>, INDHHIN2 <int>, BMXBMI <dbl>,
## # BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## # URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## # URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## # URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## # URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## # URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …
Consistent with prior tutorials, we hope to study the impacts of
specific chemical exposures (Phalates and Phytoestrogen) on log(BMI),
for participants over 18. The nhanes
data can be normalized
through this method.
nhanes <- nhanes1518 |>
filter(RIDAGEYR >= 18)|>
mutate(BMXBMIlog = log(BMXBMI),
RIDAGEYR = RIDAGEYR - 18)|>
select(BMXBMI, BMXBMIlog, URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)
nhanes <- drop_na(nhanes)
nhanes
## # A tibble: 3,523 × 12
## BMXBMI BMXBMIlog URXMHBP URXMOH URXMHP URXMHH URXMCOH URXMHNC URXECP URXHIBP
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20.3 3.01 5.3 7.8 2 8.7 0.7 1.2 14.1 27.4
## 2 24.1 3.18 0.5 8.6 3.1 11.9 0.35 0.28 16.7 1.5
## 3 43.7 3.78 1.1 8.4 0.57 16.4 0.6 0.28 25.8 6.3
## 4 38 3.64 1.5 7.8 1.2 14.4 0.6 1.1 19.7 4.7
## 5 26.3 3.27 0.5 1.7 0.57 2.7 0.6 0.9 3.9 2
## 6 25 3.22 1.9 15.4 8.2 27.6 0.35 0.5 26.1 6.1
## 7 22.8 3.13 0.28 6 5.1 9.5 14.2 72.2 14.1 3.7
## 8 34 3.53 1.8 1.9 2.1 2.7 0.35 0.4 4 4.9
## 9 31 3.43 1.5 5 2.6 6.2 0.35 0.28 9.4 1.4
## 10 34 3.53 0.28 2.6 0.57 5.1 0.9 0.8 5.9 3.2
## # ℹ 3,513 more rows
## # ℹ 2 more variables: URXMIB <dbl>, RIDAGEYR <dbl>
The qgcomp
package shares similarities to
gWQS
, but there are several key distinctions. Primarily, an
essential assumption from gWQS
was that all predictor
variables must contribute to the same direction. However,
qgcomp
can fit a model regardless of the contributing
direction of the variables.
Similar to gWQS
, qgcomp
can be applied to a
generalized linear regression model:
\[\mathbb{E}(Y | \mathbf{Z,\psi,\eta}) = g(\psi_0 + \psi_1 S_q + \mathbf{\eta Z})\]
The model is called Marginal Structural Model (MSM). In the model, \(g(\cdot)\) is a link function. \(\psi_0\) represents the intercept. \(S_q\) is the combined effect of the exposures (predictor variables) \(S_q =\sum_{j=1}^d w_j{\mathbf{X}_{q,j}}\), where \(\mathbf{X}_q\) is the transformation of of the X variable that has been sorted into quantiles and \(w_j\) represents the weights of quantiles. \(\mathbf{Z}\) represents extra covariates, and \(\eta\) represents corresponding coefficients.
library("qgcomp")
library(gWQS) # for comparison
## Welcome to Weighted Quantile Sum (WQS) Regression.
## If you are using a Mac you have to install XQuartz.
## You can download it from: https://www.xquartz.org/
qgcomp
has similar arguments to the gWQS
package. qgcomp()
wraps functions
qgcomp.glm.noboot()
and qgcomp.glm.boot()
,
corresponding to non-bootstrap-based inference and bootstrap-based
inference. qgcomp()
will select
qgcomp.glm.noboot()
for linear model, and
qgcomp.glm.boot()
for nonlinear model. Some core arguements
are listed here:
expnms
- character vector of exposures of interest
(variables we put in weighted sum). By default it will include all the
variables in the dat
argument except for the response
variable.
q
- specifies the number of quantiles to be used.
B
- number of bootstrap samples when using
qgcomp.glm.boot()
. If we fit a non-linear model, we will
employ bootstrap for simulation based inference. Usually we use the
default value, which is 200.
seed
- random seed for bootstrap-based inference.
Recommend to specify the seed when fitting nonlinear model (enable
reproducing your results).
degree
- degree of polynomial regression models. For
example, degree=2
corresponds to a quadratic
regression.
Predicting log(BMI) using the same chemical exposures (Phalates and
Phytoestrogen) from the gWQS
tutorial, a similar result can
be accomplished in a shorter timeframe.
chem_names_new <- c('URXMHBP', 'URXMOH', 'URXMHP', 'URXMHH', 'URXMCOH', 'URXMHNC', 'URXECP', 'URXHIBP', 'URXMIB')
system.time(qc.fit <- qgcomp(BMXBMIlog~.,
q = 4, # set quantile levels
dat=nhanes[,c(chem_names_new, 'BMXBMIlog')],
family=gaussian()))
## Including all model terms as exposures of interest
## user system elapsed
## 0.018 0.001 0.018
system.time(results <- gwqs(BMXBMIlog ~ wqs, mix_name = chem_names_new, data = nhanes,
q = 4, # set quantile levels
b = 1, # only set 1 bootstrap sample, not recommended for estimation, just use for time comparison
b1_pos = TRUE, b1_constr = FALSE, family = "gaussian", seed=1))
## user system elapsed
## 0.098 0.007 0.105
The gWQS
resulted in a longer elapsed time compared to
the qgcomp
model.
The output of the linear model is demonstrated below:
qc.fit
## Scaled effect size (positive direction, sum of positive coefficients = 0.106)
## URXECP URXMCOH URXMIB URXMOH URXMHH
## 0.3130 0.2796 0.2688 0.0904 0.0482
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.0875)
## URXMHP URXHIBP URXMHBP URXMHNC
## 0.497 0.266 0.162 0.076
##
## Mixture slope parameters (Delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) 3.3199177 0.0073066 3.3055971 3.334238 454.3749 < 2e-16
## psi1 0.0188245 0.0111702 -0.0030687 0.040718 1.6852 0.09203
Similar to gWQS
, a regression model output is provided
which follows the same interpretation as above. The coefficients,
standard errors, confidence intervals, t-statistics, and p-values are
provided.
The Regression output:
\[ \widehat{log(\text{BMXBMI})} = 3.319 + 0.019 \times (WQS) \]
Unlike gWQS
, the qgcomp
package allows the
variables to contribute in different directions, as demonstrated with
the “weights”. Instead of unidirectional relationships, a weight is
assigned to the individual chemical exposure corresponding to “the
proportion of the overall effect when all of the exposures have effects
in the same direction” (R
tutorial of qgcomp package). The weights are shown in the figure
below:
plot(qc.fit)
An interesting output of the qgcomp
model is the sum of
the positive and negative coefficients. In application, this appears to
equate to a “partial effect”, the impact of a chemical with a
positive/negative coefficient. However, it cannot be interpreted as a
“true partial effect” due to the fact that the sum of the coefficients
depends on the model results/fit.
In order to estimate the partial effects without the issues discussed above, a sample splitting procedure can be used by creating training and validation dataset. More information on how to complete this procedure can be found on the CRAN website.
Here provides the model diagnosis.
par(mfrow = c(2, 2))
plot(qc.fit$fit)
Examining the residual plots (figure 1 and figure 2), the datum appear to be fairly evenly spread about the axis. However, there appears to be a cluster of points centered between 3.3 and 3.5. From Q-Q plot, only very few points deviates from the theoretical quantiles. Only three data points marked as high leverage points. Despite this, there are generally no major concerns with the model diagnosis.
A similarity to Weighted Quantile Sum Regression is the ability to adjust for covariates. As demonstrated in the WQS tutorial, we are interested if our model does a better job of predicting BMI while adjusting for the age variable, RIDAGEYR.
covars = c('RIDAGEYR')
qc.fit.covar <- qgcomp(BMXBMIlog~.,
expnms = chem_names_new,
dat=nhanes[,c(chem_names_new,covars, 'BMXBMIlog')],
q = 4, family=gaussian())
qc.fit.covar
## Scaled effect size (positive direction, sum of positive coefficients = 0.0977)
## URXECP URXMIB URXMCOH URXMOH URXMHH
## 0.3095 0.2912 0.2660 0.0835 0.0499
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.0745)
## URXMHP URXHIBP URXMHBP URXMHNC
## 0.51079 0.27477 0.20702 0.00742
##
## Mixture slope parameters (Delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) 3.286845 0.010479 3.2663069 3.307383 313.6686 < 2e-16
## psi1 0.023213 0.011186 0.0012888 0.045137 2.0752 0.03804
In order to determine if the model with the age covariate is better than our original model, we can once again compare the AIC:
#AIC For First Model
AIC(qc.fit)
## [1] -348.3578
#AIC For Covariate Model
AIC(qc.fit.covar)
## [1] -365.6584
The second model, with age added as a covariate, has a lower AIC. This allows us to conclude that log(BMI) can be better predicted with age included in the model.
Another useful application of qgcomp
is to find the
effect of mixtures on binary variables (variables that can be
represented by 1’s “Yes” or 0’s “No”). In this example, we want to
predict whether a participant is above the average BMI based on the
chemical mixtures used in the previous example. We can create a binary
variable corresponding for a participant that is greater than the
average BMI (represented with 1) or below the average BMI (represented
with 0).
The following code can automate this process.
nhanes <- nhanes |>
mutate(BMXBMIOVER = ifelse(BMXBMI >= 24.9,1,0))
Now that the BMIOVER
variable is now in terms of 1’s and
0’s, we can use qgcomp
to predict if a participant is over
the average BMI using the chemical mixtures. In the qgcomp
function, we need to set family=binomial()
for logistic
regression.
chem_names_new <- c('URXMHBP', 'URXMOH', 'URXMHP', 'URXMHH', 'URXMCOH', 'URXMHNC', 'URXECP', 'URXHIBP', 'URXMIB')
qc.fit.logistic <- qgcomp(BMXBMIOVER~.,
dat=nhanes[,c(chem_names_new, 'BMXBMIOVER')],
q = 4,
family=binomial(),
seed = 125)
## Including all model terms as exposures of interest
## Expected time to finish: 0.65 minutes
qc.fit.logistic
## Mixture log(RR) (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)
## (Intercept) -0.3809412 0.0240369 -0.42805 -0.333830 -15.848 <2e-16
## psi1 -0.0040821 0.0404028 -0.08327 0.075106 -0.101 0.9195
We can interpret both the intercept intercepts and coefficient for the models:
As the quantile sum increases by 1, an individual’s BMI being over vs. under the national average decreases by approximately 0.40% (exp(-0.004)-1).
When the quantile sum is approximately zero, the probability of an individual being over vs. under the national average is approximately 0.683 (exp(-0.381)).
However, unlike gWQS
, qgcomp
can deal with
non-linearity, where the relationship between the response and
predictors do not follow a linear pattern. In the following example, by
specifying . + .^2
the same predictor variables are
transformed and squared in order to reflect a non-linear relationship:
include extra quadratic terms for all predictors.
\[\mathbb{E}(Y | \mathbf{\psi}) = \psi_0 + \psi_1 S_q',\quad where \quad S_q'=S_q+S_q^2 \] Notice here \(\sum w_j (S_{q,j}+S_{q,j}^2) = 1\).
The model is still linear if we treat \(S_q'\) as new weighted quantile sum. The linear trend is reflected in the MSM fit line. (MSM stands for estimator based on Marginal Structural Model, which is shown above.)
qc.fit.quad1 <- qgcomp(BMXBMIlog~ . + .^2,
expnms=chem_names_new,
q = 4,
dat = nhanes[,c(chem_names_new, 'BMXBMIlog')],
family=gaussian(),seed=125)
qc.fit.quad1
## Mixture slope parameters (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) 3.346262 0.028690 3.29003 3.402493 116.6371 <2e-16
## psi1 -0.039581 0.047798 -0.13326 0.054102 -0.8281 0.4077
plot(qc.fit.quad1)
We can introduce degree = 2
to introduce quadratic
trend.
\[\mathbb{E}(Y | \mathbf{\psi}) = \psi_0 + \psi_1 S_q + \psi_2 S_q^2\]
Notice here \(\sum w_jS_{q,j} = 1, \quad \sum w_jS_{q,j}^2 = 1\). The MSM fit linear present a quadratic trend. The interpretation is similar to that of generalized linear model with quadratic terms.
qc.fit.quad2 <- qgcomp(BMXBMIlog~ . + .^2,
expnms=chem_names_new,
dat = nhanes[,c(chem_names_new, 'BMXBMIlog')],
q=4,
degree = 2, # control the quadratic
family=gaussian(), seed=125)
plot(qc.fit.quad2)
qc.fit.quad2
## Mixture slope parameters (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) 3.3114870 0.0086998 3.294436 3.328538 380.6379 <2e-16
## psi1 0.0647453 0.0468907 -0.027159 0.156649 1.3808 0.1674
## psi2 -0.0347754 0.0278097 -0.089281 0.019731 -1.2505 0.2112
In the previous example, we specify the non-linear regression model
based on our prior (quadratic terms). Here qgcomp
can also
allow learning of nonlinear trend by using splines. Splines models
approximate an unknown nonlinear relationship using a “piecewise”
polynomial regression method (as shown below):
To compute the models, a new package called splines
will
be used. For more information about the Splines package, please visit A
review of spline function procedures in R.
library(splines)
Often with chemical mixtures, variables that have high correlations may have a non-linear trend. Exposures that are highly correlated with each other may have an underlying similarities, “a common environmental source or set of behaviors” (R tutorial of qgcomp package). First, a correlation matrix can be used to determine which chemicals found in urine samples have a strong correlation.
By viewing the following correlation matrix of the urine samples, we can see that several variables are highly correlated with each other. For this example, we will use the URXMOH and URXMHH variables since they have the highest pairwise correlation.
nhanescor <- nhanes|> select(URXMHBP, URXMOH, URXMHP, URXMHH, URXMCOH, URXMHNC, URXMHH, URXECP, URXHIBP, URXMIB, RIDAGEYR)
cor_hanes <- cor(nhanescor)
round(cor_hanes,digits = 2) # present correlation in 2 digits
## URXMHBP URXMOH URXMHP URXMHH URXMCOH URXMHNC URXECP URXHIBP URXMIB
## URXMHBP 1.00 0.13 0.08 0.11 0.03 0.02 0.10 0.24 0.19
## URXMOH 0.13 1.00 0.72 0.97 0.02 0.02 0.83 0.20 0.19
## URXMHP 0.08 0.72 1.00 0.75 0.02 0.02 0.59 0.14 0.16
## URXMHH 0.11 0.97 0.75 1.00 0.03 0.02 0.80 0.18 0.18
## URXMCOH 0.03 0.02 0.02 0.03 1.00 0.87 0.02 0.07 0.08
## URXMHNC 0.02 0.02 0.02 0.02 0.87 1.00 0.01 0.06 0.07
## URXECP 0.10 0.83 0.59 0.80 0.02 0.01 1.00 0.18 0.17
## URXHIBP 0.24 0.20 0.14 0.18 0.07 0.06 0.18 1.00 0.93
## URXMIB 0.19 0.19 0.16 0.18 0.08 0.07 0.17 0.93 1.00
## RIDAGEYR 0.02 0.03 -0.07 0.01 -0.03 -0.03 0.02 -0.08 -0.08
## RIDAGEYR
## URXMHBP 0.02
## URXMOH 0.03
## URXMHP -0.07
## URXMHH 0.01
## URXMCOH -0.03
## URXMHNC -0.03
## URXECP 0.02
## URXHIBP -0.08
## URXMIB -0.08
## RIDAGEYR 1.00
To find the best fit for the data, we compare the nonlinear model
(splines) and linear model. To fit a model with splines, we only need to
apply bs()
to variables which we want to apply a spline
transformation for.
qc.fit.nonlin <- qgcomp(BMXBMIlog ~ URXMHBP + bs(URXMOH) + URXMHP + bs(URXMHH) + URXMCOH + URXMHNC + URXECP + URXHIBP + URXMIB,
expnms=chem_names_new,
nhanes[,c(chem_names_new, 'BMXBMIlog')], family=gaussian(), q=8, B=5)
qc.fit.nonlin
## Mixture slope parameters (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) 3.30752346 0.00634210 3.2950932 3.3199538 521.5184 <2e-16
## psi1 -0.00073031 0.00379397 -0.0081664 0.0067057 -0.1925 0.8474
Here we print out the coefficient estimation. We observe the
variables which we applied splines presented as 3 terms, such as
bs()1
, bs()2
, bs()3
. Each of
these term represent a different B-spline basis function which are not
directly interpretable.
summary(qc.fit.nonlin$fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.294195666 0.010374180 317.53792413 0.000000e+00
## URXMHBP -0.007552883 0.003082389 -2.45033433 1.432086e-02
## bs(URXMOH)1 0.090332981 0.053642307 1.68398764 9.227300e-02
## bs(URXMOH)2 0.045028762 0.052532542 0.85715940 3.914153e-01
## bs(URXMOH)3 0.090930100 0.048191739 1.88683999 5.926433e-02
## URXMHP -0.028385084 0.003774605 -7.52001349 6.916581e-14
## bs(URXMHH)1 -0.008169848 0.054051792 -0.15114851 8.798673e-01
## bs(URXMHH)2 0.001451855 0.050762731 0.02860080 9.771846e-01
## bs(URXMHH)3 0.004433084 0.045618474 0.09717739 9.225911e-01
## URXMCOH 0.013493984 0.006543123 2.06231560 3.925115e-02
## URXMHNC -0.007130001 0.004951339 -1.44001491 1.499524e-01
## URXECP 0.013896047 0.004619666 3.00802026 2.648224e-03
## URXHIBP -0.017128923 0.004084675 -4.19346028 2.815197e-05
## URXMIB 0.020223594 0.003817911 5.29703120 1.249438e-07
We can plot the model fitting, we observe the smooth conditional fit is nonliear which presents the nonlinear relationship captured by our splines.
plot(qc.fit.nonlin)
We can examine the residual plots using plot(model$fit)
.
The residuals appear to generally follow the QQ plot and appear to be
evenly scattered in the Residuals vs Fitted visualization.
par(mfrow = c(2, 2))
plot(qc.fit.nonlin$fit)
We compare splines regression with linear regression using AIC. The non-linear model using spines produces the lower AIC compared to the standard linear form. This means that the splines method produced a more appropriate model for the highly correlated urine samples.
qc.fit.lin <- qgcomp(BMXBMIlog ~ URXMHBP + URXMOH + URXMHP + URXMHH + URXMCOH + URXMHNC + URXECP + URXHIBP + URXMIB,
expnms=chem_names_new,
q = 4,
dat = nhanes[,c(chem_names_new, 'BMXBMIlog')], family=gaussian())
AIC(qc.fit.lin)
## [1] -348.3578
AIC(qc.fit.nonlin)
## [1] -376.4991
It is often that datum may include missing entries, possibly causing concern for conclusions. Similar to linear regression, there are methods to handle these situations.
Primarily, a complete case analysis is often used for creating models. This means that all data entries have complete information, those with NA’s are removed entirely. Unlike other regression packages, qgcomp package requires the researcher to make their own dataset with the function complete.cases.
Additionally, missing data can become problematic when chemical exposures fall below the limit of detection. One method to correct this issue is to use very small values for the missing entries. As a general rule for qgcomp, if the proportion of the number of entries (below the limit of detection) are less than 1/(q), where q is the number of quantiles used, small values may be utilized.
Lastly, missing data can be handled by a method called multiple imputation. This procedure replaces entries without data with “specific values” so that all participants can be included in the study (those with missing data do not have to be excluded entirely). Similar to above, a new dataset can be created using multiple imputation. Suggested packages include mice, with the function mice.impute.leftcenslognorm for best interface results with qgcomp.
An additional feature of qgcomp
is the use of
time-to-event data analysis. This method helps predict whether and when
an event happened. Standard linear models are unable to compute this
metric, but it can be helpful in medical applications.
The qgcomp
package also has the capabilities to compute
models despite the data having covariance among observations (the error
term is not independent for individual observations). To compensate for
this issue, cluster-based sampling and bootstrapping can be used through
qgcomp.glm.boot
.
To read more about how to complete these steps and for additional applications of the qgcomp model, visit the vignettes.