Topics

  • Basic ideas of Bayesian Model Averaging

  • Application of Bayesian Model Averaging using R package BAS

  • Estimation, Interpretation and Prediction

  • Advantages of Bayesian Model Averaging over all-or-none model selection

Bayesian Model Averaging

Suppose we are interested in exploring the relationship between the chemical exposures and the outcome (BMI for example). These chemical exposures are usually highly correlated with each other (known as multicollinearity). One way is to summarize these highly correlated chemical mixtures and explore the potential influence of these chemical mixtures as a whole (See tutorials of weighted quantile sum regression, Bayesian weighted sum regression, quantile g-computation). However, the model require a strict assumption that all exposures should contribute to the same direction which is not applicable in some situations. Another popular way is to select a subset of these chemical mixtures as representatives, which is called variable selection, or model selection. Traditionally, analysis often proceeds by first selecting the best model according to some criterion and then learning about the parameters given that the selected model is the underlying truth. However, this approach has potential issues: (1) We cannot quantify model uncertainty through these all-or-none selection methods. (2) There are often many modeling choices that are secondary to the main questions of interest but can still have an important effect on conclusions.

As an alternative, Bayesian Model Averaging (BMA) carry a model combination idea. Instead of choosing only one model, it learns the parameters for all candidate models and then combine the estimates according to the uncertainty (posterior probabilities) of associated model. Specifically, this is done through a parameter estimate obtained by averaging the predictions of the different models under consideration, each weighted by its model probability.

Given quantity of interest \(\Delta\) (which can be a future observation \(Y^*\), or a parameter of interest \(\beta_j\)), then its posterior distribution of given data \(Y\) is

\[Pr(\Delta|Y)=\sum_{k=1}^KPr(\Delta|M_k,Y)Pr(M_k|Y).\] Let \(K\) denote the number of all potential models, the model probability of model \(k\) (\(Pr(M_k|Y)\)) is calculated by

\[Pr(M_k|Y) = \frac{Pr(Y|M_k)Pr(M_k)}{\sum_{l=1}^K Pr(Y|M_l)Pr(M_l)}\]

Bayesian Model Averaging combines predictions from multiple models by weighting them according to their posterior probabilities given observed data, a process facilitated by Bayes’ Theorem, while the Law of Total Probability ensures a comprehensive estimation by considering all possible model outcomes.

The model \(M\) can take various forms, such as linear models, generalized linear models, survival analysis. Here in this tutorial, we will only focus on BMA for linear regression.

There are three packages available: BAS, BMS, and BMA. A thorough comparison are presented in this paper. Here we use BAS package for its fast computation, flexible choices of priors, and various options for inference and diagnosis.

Libraries

The R package BAS provides ways of carrying out Bayesian Model Averaging (BMA) for linear regression, generalized linear models, and survival or event history analysis using Cox proportional hazards models. It contains functions for plotting the BMA posterior distributions of the model parameters, as well as an image plot function that provides a way of visualizing the BMA output. The functions bas.lm provide fast and automatic default ways of doing this for the model classes considered.

library(BAS)
library(tidyverse)

To illustrate how BMA takes account of model uncertainty about the variables to be included in linear regression, we will be using the nhanes dataset where the variables are described in the file nhanes-codebook.txt. Load this data with the load function and specify the data file.

load(file='nhanes1518.rda')

Exploratory Data Analysis

We focus on exploring the poential effect of URX predictors (i.e. the ones related to phthalates concentrations) on BMI. Here we first select URX predictors and then filter out NA values:

nhanes_URX<-nhanes1518%>%
  select(BMXBMI, URXUCR, URXCNP,URXCOP,URXECP,URXHIBP,URXMBP,URXMC1,URXMEP,URXMHBP,URXMHH)%>%
  na.omit() 

We start exploration by plotting a correlation matrix between the variables of interest. We can use the corrplot function within the corrplot package, and adjust the style to fit our aesthetics of desire. We see that the predictors are highly correlated, especially between URXMHH and URXECP, and URXMHBP and URXMBP.

library(corrplot)
corr_mat=cor(nhanes_URX,method="s")
corrplot(corr_mat, 
         addCoef.col = 1,    
         number.cex = 0.5) # Change font size of correlation coefficients

# other styles:
# default 
# corrplot(corr_mat) # circles
# squares, variables presented in an alphabet order
# corrplot(corr_mat, method = 'color', order = 'alphabet') # squares
# can choose different style for lower and upper triangle, can order the variable by clustering result
#corrplot.mixed(corr_mat, lower = 'shade', upper = 'pie', order = 'hclust')
df_long <- nhanes_URX %>% pivot_longer(everything())

# Plot histograms using ggplot
ggplot(df_long, aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ name, scales = "free")

We also plot the distributions of our variables of interest involved in the model, and notice that the variables have a long tail in the original scale, which hints at log transformation which we will perform later on.

Bayesian Model Averaging for Linear Regression Models

As shown in the formula above, BMA addresses model uncertainty in a regression problem, and here we consider \(M_k\) to be a linear regression. Suppose a linear model structure, with \(y\) being the dependent variable, \(\alpha\), an intercept, \(\beta\) the coefficients, and \(\epsilon\) a normal IID error term with variance \(\sigma^2\): \[y=\alpha+X \beta+\epsilon \hspace{1cm}\epsilon \sim N(0, \sigma^2I),\] where \(y\) is a \(n\) by 1 vector \(y=(y_1,...,y_n)’\) where n is sample size

Since the X’s may be correlated, we may want to select a subset of all the \(p\) variables. All the combinations of potential models are \(2^p\), since for each variable, we could choose to either include it or not. We have \(p\) variables, therefore we have \(2^p\) possible models \((M_1, …, M_{2^p})\) To indicate whether we include the variable, we introduce an indicator \(\gamma_i\), where \(i = 1,...,p\). \(\gamma_i=1\) indicates we include the \(i^{th}\) variable in the model, while \(\gamma_i=1\) indicates we exclude the \(i^{th}\) variable from the model. Therefore, the model \(M_k\) becomes

\[y=\alpha_\gamma+X_\gamma \beta_\gamma+\epsilon \hspace{1cm}\epsilon \sim N(0, \sigma^2I),\] where \(\gamma=(\gamma_1,\dots,\gamma_p)\) is a \(p\) by 1 vector. \((\alpha_\gamma,X_\gamma,\beta_\gamma)\) indicates a subset of variables nested within \((\alpha,X,\beta)\) where variables are included when corresponding \(\gamma_i=1\).

The posterior probability of \(\gamma_i\) is calculated as the sum of the posterior probabilities of all models that include that model or variable. \[P(\gamma_i=1|Y)=\sum_{k=1}^KPr(\gamma|M_k,Y)Pr(M_k|Y).\]

We called it inclusion probability (\(\pi_i\)). It measures the importance of variable i.

We perform log transformation to all the variables since the EDA suggests the long tail and extreme values in the original scale. We fit a BMA model on the dataset using BMI as the response variable, and all of the 10 URX variables as predictor variables.

BAS uses a model formula similar to lm to specify the full model with all of the potential predictors. Here we are using the shorthand . to indicate that all remaining variables in the data frame provided by the data argument.

To get started, we will use BAS with the Zellner-Siow Cauchy prior on the coefficients. We will discuss the different choice of priors in later sections.

nhanes_URX<-log(nhanes_URX)
model <-bas.lm(BMXBMI ~ .,
  data = nhanes_URX,
  prior = "ZS-null",
  modelprior = uniform(), initprobs = "eplogp", method="MCMC",
  force.heredity = FALSE
  # The force.heredity = TRUE argument means that we force levels of a factor 
  #to enter or leave together
)

Here we specify the ZS-null as the prior argument, and the argument uniform() means that the prior distribution over the models is a uniform distribution which assigns equal probabilities to all models. The optional argument initprobs=eplogp provides a way to initialize the sampling algorithm and order the variables in the tree structure that represents the model space in BAS. The eplogp option uses the Bayes factor calibration of p-values \(-eplog(p)\) to provide an approximation to the marginal inclusion probability that the coefficient of each predictor is zero, using the p-values from the full model. The force.heredity = TRUE argument means that we force levels of a factor to enter or leave together.

summary(model)
##           P(B != 0 | Y)  model 1     model 2     model 3     model 4
## Intercept    1.00000000   1.0000   1.0000000   1.0000000   1.0000000
## URXUCR       0.99970703   1.0000   1.0000000   1.0000000   1.0000000
## URXCNP       0.99921875   1.0000   1.0000000   1.0000000   1.0000000
## URXCOP       0.61220703   1.0000   0.0000000   0.0000000   1.0000000
## URXECP       0.99970703   1.0000   1.0000000   1.0000000   1.0000000
## URXHIBP      0.99960938   1.0000   1.0000000   1.0000000   1.0000000
## URXMBP       0.94902344   1.0000   1.0000000   1.0000000   1.0000000
## URXMC1       0.06538086   0.0000   0.0000000   0.0000000   1.0000000
## URXMEP       0.99936523   1.0000   1.0000000   1.0000000   1.0000000
## URXMHBP      0.99990234   1.0000   1.0000000   1.0000000   1.0000000
## URXMHH       0.90410156   1.0000   1.0000000   0.0000000   1.0000000
## BF                   NA   1.0000   0.5747392   0.1126271   0.1144826
## PostProbs            NA   0.4893   0.3131000   0.0486000   0.0481000
## R2                   NA   0.2726   0.2716000   0.2703000   0.2729000
## dim                  NA  10.0000   9.0000000   8.0000000  11.0000000
## logmarg              NA 869.0414 868.4875542 866.8577204 866.8740607
##                model 5
## Intercept   1.00000000
## URXUCR      1.00000000
## URXCNP      1.00000000
## URXCOP      1.00000000
## URXECP      1.00000000
## URXHIBP     1.00000000
## URXMBP      1.00000000
## URXMC1      0.00000000
## URXMEP      1.00000000
## URXMHBP     1.00000000
## URXMHH      0.00000000
## BF          0.06046422
## PostProbs   0.03700000
## R2          0.27100000
## dim         9.00000000
## logmarg   866.23568961

summary() provides summary statistics and lists the top 5 models (in a descending order of posterior probability) with the zero-one indicators for variable inclusion. The first column presents posterior probability of inclusion (PIP) for each variable. Here we see that the URXUCR,URXCNP, URXECP,URXHIBP,URXMEP,URXMHBP are included in almost every single model (inclusion probability very close to 1), thus deemed to be important by the BMA model. URXMBP and URXMHH have pip greater than or roughly equal to 0.9, suggesting strong correlations with BMI. On the other hand, the inclusion probability of URXMC1 variable is only 0.07, therefore deemed to be a weak predictor of BMI by the BMA model. The other column represents a binary outcome of whether the model includes the variable or not. The other rows in the summary table are BF (Bayes Factor), PostProbs (Posterior Probability), R2(\(R^2\) statistics), dim (dimension of model), and log marginal likelihood under the model.

The Bayes factor row represents Bayes factor of each model to the highest probability model (hence its Bayes factor is 1). Bayes Factor is a measure of the relative evidence for two competing models, typically referred to as Model 1 (\(M_1\)) and Model 2 (\(M_2\)). It quantifies how much more or less likely the data are under one model compared to the other. If \(BF(M_1,M_2) > 1\): It suggests that Model 1 is more supported by the data than Model 2. If \(BF(M_1,M_2) < 1\): It suggests that Model 2 is more supported by the data than Model 1. We can make use of Bayes factor for model comparison.

Similarly, other measurements acts as good alternatives for model comparison. The posterior probabilities of models measures how likely the model assumption are supported by the data. The ordinary \(R^2\) of the models indicate how much variation of data are explained by the model. The dimension of the models captures the complexity of the model. The log marginal likelihood can also evaluate performance of the model.

The image function help to visualize the potential model candidates, which looks like a crossword puzzle:

image(model, rotate=F)

Here, the predictors, including the intercept, are on the y-axis, while the x-axis corresponds to each different model. The color indicates log posterior odds, representing how likely the corresponding model are supported by the data. And black color indicates the corresponding variable (row-wise) is not included in a particular model (column-wise). The colors show how well each model fits the data, with the color scale proportional to posterior probabilities. Orange signifies the best-fitting models, while dark violet indicates the least supported models among the top 20. Models with similar colors can be considered to have no significant difference in performance. We usually view the image by rows to check the importance of a variable. Here the plot indicates that most of the URX variables are significant in most models, except for URXMC1,URXCOP,URXMBP,URXMHH which are less likely included in most models.

We can also plot the posterior distributions of these coefficients to take a closer look at potential influence of each variable:

coef.model <- coef(model)
par(mfrow=c(2,2))
plot(coef.model, ask = F)

Notice the sticks located at 0 represent posterior inclusion probability. If not present, then the BMA model suggests that we should not include the variable. We observe that the posterior probability distributions of URXMC1 and URXCOP have a very large point mass at 0, while the distribution of URXUCR and URXMEP have relatively small mass at 0. There is a slighly little tip at 0 for the variable URXUCR, indicating that the posterior inclusion probability of URXECP is not exactly 1. This plot agrees with the summary table we obtained above.

confint provides the credible interval of coefficients, which can be visualized using plot.

confint(coef.model)
##                   2.5%       97.5%          beta
## Intercept  3.193504119  3.20744731  3.2004289902
## URXUCR     0.199831421  0.22564170  0.2127607122
## URXCNP    -0.043350711 -0.02080917 -0.0318914429
## URXCOP     0.000000000  0.01858995  0.0074521869
## URXECP    -0.106335816 -0.04809551 -0.0820363505
## URXHIBP   -0.073680962 -0.05317933 -0.0635287299
## URXMBP     0.000000000  0.04088854  0.0265956633
## URXMC1    -0.005517034  0.00000000 -0.0004686184
## URXMEP     0.020279614  0.03176388  0.0260114879
## URXMHBP   -0.092755428 -0.05711750 -0.0765324324
## URXMHH     0.000000000  0.05178882  0.0316474773
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
# set parm = 2:11 to exclude intercept in the visualization
plot(confint(coef.model,parm = 2:11),estimator = "HPM") 

## NULL

The output of confint() is a table, with columns 2.5%, 97.5% representing 2.5% and 97.5% quantiles of posterior distribution, column beta representing the posterior mean. We see that URXCOP does not have significant effect on BMI since the credible interval include 0. URXUCR and URXMEP have strong positive effects on BMI since the credible interval is entirely positive, whereas URXHIBP and URXCNP have negative effects on BMI since the credible interval is negative. HPM stands for the highest probability model. Other choices of estimator include MPM (the median probability model of Berbieri and Berger), BPM (best predictive model) and BMA (Bayesian model averaging).

Model Diagnostics

par(mfrow=c(2,2))
plot(model, ask = F)

The Residuals vs. Fitted Values plot helps to check for linearity and homoscedasticity. Ideally, residuals should be randomly scattered around the horizontal line at zero, indicating that the model captures the underlying linear relationship between the predictors and the response variable without systematic patterns. However, the residuals form an elliptical pattern, suggesting that the assumptions of linearity and homoscedasticity may be violated. Possible reasons for this include the omission of important factors in the model and the constraint that the chemical mixtures are bounded by 0.

The model probabilities plot provides cumulative probability of posterior probabilities of models. We can observe model 15 and model 16 have relative large posterior probabilities compared to other alternatives.

The model complexity plot provides log marginal likelihood versus model dimension. We observe an increase of log marginal likelihood as model dimension increases. However, when the model is complicated (dimension over 6), it does not bring significant improvement to the likelihood.

The inclusion probabilities plot provides a visualization on posterior inclusion probability(PIP) for each variable, with red color indicating PIP greater or equal to 0.5. We observe that URXMC1 is not important and excluded in most of models.

diagnostics(model, type = "pip", pch = 16) 

The convergence plot is a straight line, which indicates that the sampler has reached convergence, meaning that the sampled values adequately represent the posterior distribution of interest.

Model Predictions of BMA

BAS has methods defined to return fitted values, namely fitted, using the observed design matrix and predictions at either the observed data or potentially new values, predict, as with linear regressions. predict provides model fitting and prediction. fitted produce the fitted values, which is equivalent to predict()$fit

BMA_fitted <- fitted(model, estimator = "BMA")
# predict has additional slots for fitted values under BMA, predictions under each model
BMA <- predict(model, estimator = "BMA")

Using the se.fit = TRUE, we can also calculate standard deviations for prediction or for the mean and use this as input for the confint function for the prediction object. The mean BMI of a population with fairly typical exposure values will between 24.37 and 24.71. For an individual with fairly typical exposure values, his/her BMI on average will between 14.67 and 41.54.

# take the average of chemical exposures. 
avg_ind = as.data.frame(t(apply(nhanes_URX[,-1],2,mean)))

BMA_pred <- predict(model, avg_ind, estimator = "BMA", se.fit = TRUE)
# confidence interval
confint(BMA_pred, parm = "mean") |> exp() 
##          2.5%    97.5%     mean
## [1,] 24.37729 24.71728 24.54306
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
# prediction interval
confint(BMA_pred, parm = "pred") |> exp() 
##          2.5%    97.5%     pred
## [1,] 14.50688 41.51436 24.54306
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The package also offers visualizations for the intervals. When predicting individuals with minimal exposure (1st percentile), typical exposure (50th percentile), and high exposure (99th percentile), we provide visualizations for both the population-level confidence intervals and individual-level prediction intervals. We observe that, at the population level, individuals with typical exposure have higher BMIs compared to those with minimal or high exposure. However, at the individual level, the differences are quite small, as indicated by the overlapping prediction intervals.

newdat = as.data.frame(apply(nhanes_URX[,-1],2,quantile,c(0.01,0.5,0.99)))

bma_pred <- predict(model, newdat, estimator = "BMA", se.fit = TRUE)
model.fit <- confint(bma_pred, parm = "mean")
model.pred <- confint(bma_pred, parm = "pred")
## visualize confidence interval
vis_confidence = model.fit %>% exp()
class(vis_confidence) <- "confint.bas"
plot(vis_confidence)

## NULL
## visualize prediction interval
vis_pred = model.pred %>% exp()
class(vis_pred) <- "confint.bas"
plot(vis_pred)

## NULL

Model Predictions with Model Selection

In addition to using BMA, we can use the posterior means under model selection. This corresponds to a decision rule that combines estimation and selection. The BAS package currently offers three options: the highest probability model (HPM), the median probability model (MPM), and the best predictive model (BPM). The choice of model depends on the analysis goal. For interpretation purposes, we recommend using HPM. If the model includes many variables and the sample size is relatively small, to avoid overfitting and achieve robustness, we suggest using MPM. For prioritizing prediction performance, we recommend using BPM.

Highest Probability Model

Pros:

  • Simplicity: HPM is straightforward to understand and implement. It selects the model with the highest posterior probability.

  • Interpretable: The HPM is easy to interpret because it directly represents the most probable model.

Cons:

  • Overfitting: HPM can be prone to overfitting because it chooses the model that fits the data best, even if it’s overly complex.

  • Ignores model uncertainty: It doesn’t consider the uncertainty associated with other models, potentially leading to suboptimal decisions when multiple models are equally plausible.

HPM <- predict(model, estimator = "HPM")

# show the indices of variables in the best model where 0 is the intercept
HPM$bestmodel
##  [1]  0  1  2  3  4  5  6  8  9 10

Now we explore a little more interpretable version with names:

variable.names(HPM)
##  [1] "Intercept" "URXUCR"    "URXCNP"    "URXCOP"    "URXECP"    "URXHIBP"  
##  [7] "URXMBP"    "URXMEP"    "URXMHBP"   "URXMHH"

Median Probability Model

MPM <- predict(model, estimator = "MPM")
variable.names(MPM)
##  [1] "Intercept" "URXUCR"    "URXCNP"    "URXCOP"    "URXECP"    "URXHIBP"  
##  [7] "URXMBP"    "URXMEP"    "URXMHBP"   "URXMHH"

This is the model where all predictors have an inclusion probability greater than or equal to 0.5. This coincides with the HPM if the predictors are all mutually orthogonal, and in this case is the best predictive model under squared error loss.

Pros:

  • Reduces sensitivity to outliers: MPM is less affected by extreme probabilities, making it more robust in cases where a single model has an unusually high or low probability.

  • Balances between overfitting and underfitting: It often results in a model that strikes a balance between overfitting and underfitting, as it considers the entire distribution of model probabilities.

Cons:

  • Complexity: Finding the MPM involves calculating the median over the model probabilities, which can be computationally intensive or challenging for complex models.

  • May not align with predictive performance: While it reduces sensitivity to outliers, the MPM may not necessarily provide the best predictive performance.

Note that we can also extract the best model from the attribute in the fitted values as well.

Best Predictive Model

In general, the HPM or MPM are not the best predictive models, which from a Bayesian decision theory perspective would be the model that is closest to BMA predictions under squared error loss.

Pros:

  • Emphasizes predictive accuracy: BPM selects the model that gives the best predictive performance on average, making it a suitable choice when the ultimate goal is to make accurate predictions.

  • Accounts for model uncertainty: It considers the predictive performance of all models, providing a more robust and uncertainty-aware approach.

Cons:

  • Complexity: Implementing BPM can be computationally intensive, as it requires estimating predictive performance for each model in the set.

  • May not be as interpretable: The selected BPM might not be as easy to interpret as the HPM.

BPM <- predict(model, estimator = "BPM")
variable.names(BPM)
##  [1] "Intercept" "URXUCR"    "URXCNP"    "URXCOP"    "URXECP"    "URXHIBP"  
##  [7] "URXMBP"    "URXMEP"    "URXMHBP"   "URXMHH"

Prior Selection

Prior selection is a critical step in Bayesian model averaging (BMA) as it determines the weights assigned to each model in the model space. The choice of prior distribution for the model parameters can significantly impact the BMA results. A well-informed prior can help avoid overfitting, reduce uncertainty, and improve model selection. However, selecting an appropriate prior can be challenging as it requires balancing between being informative enough to guide the model towards plausible solutions and being uninformative enough to avoid biasing the results. Prior elicitation techniques such as expert opinion, empirical data, and sensitivity analysis can be employed to guide the choice of priors. Overall, careful prior selection is crucial for obtaining reliable and accurate BMA results.

If you lack strong prior knowledge, we recommend using the default prior of the package, the Zellner-Siow prior (ZS-null). This prior strikes a balance between model complexity and goodness of fit, typically resulting in a robust estimator.

Below, we list all the prior options available in the package, along with detailed information and comparisons for each.

  • “BIC”
  • “AIC
  • “g-prior”, Zellner’s g prior where ‘g’ is specified using the argument ‘alpha’
  • “hyper-g”, a mixture of g-priors where the prior on g/(1+g) is a Beta(1, alpha/2) as in Liang et al (2008). This uses the Cephes library for evaluation of the marginal likelihoods and may be numerically unstable for large n or R2 close to 1. Default choice of alpha is 3.
  • “hyper-g-laplace”, Same as above but using a Laplace approximation to integrate over the prior on g.
  • “hyper-g-n”, a mixture of g-priors that where u = g/n and u ~ Beta(1, alpha/2) to provide consistency when the null model is true.
  • “JZS” Jeffreys-Zellner-Siow prior which uses the Jeffreys prior on sigma and the Zellner-Siow Cauchy prior on the coefficients. The optional parameter ‘alpha’ can be used to control the squared scale of the prior, where the default is alpha=1. Setting ‘alpha’ is equal to rscale^2 in the BayesFactor package of Morey. This uses QUADMATH for numerical integration of g.
  • “ZS-null”, a Laplace approximation to the ‘JZS’ prior for integration of g. alpha = 1 only. We recommend using ‘JZS’ for accuracy and compatibility with the BayesFactor package, although it is slower.
  • “ZS-full” (to be deprecated)
  • “EB-local”, use the MLE of g from the marginal likelihood within each model
  • “EB-global” uses an EM algorithm to find a common or global estimate of g, averaged over all models. When it is not possible to enumerate all models, the EM algorithm uses only the models sampled under EB-local.

Here are some recommendations for different priors under different situations, as well as pros and cons for different priors:

BIC:

  • Use Case: BIC is often used for model selection in frequentist statistics. It penalizes model complexity and is appropriate when you want to balance model fit with model complexity.

  • Pros: It is simple to compute and tends to select more parsimonious models.

  • Cons: BIC may not perform well when the true model is not among the candidates, and it assumes that models are nested.

AIC:

  • Use Case: AIC is also used for model selection and penalizes model complexity. It is useful when you want to balance model fit with model complexity.

  • Pros: Like BIC, AIC is relatively simple to compute and can be used for a wide range of models.

  • Cons: AIC can favor more complex models when the sample size is large, and it may not always be suitable for small sample sizes.

g-Prior (Zellner’s g-prior):

  • Use Case: The g-prior is commonly used when you want to specify a prior on the regression coefficients that balances between an uninformative prior (large ‘g’) and a more informative prior (small ‘g’).

  • Pros: It allows you to control the strength of the prior information using the ‘alpha’ parameter. It provides a flexible way to incorporate prior beliefs.

  • Cons: The choice of ‘alpha’ can impact results, and selecting an appropriate value may require prior knowledge.

Jeffreys-Zellner-Siow (JZS) Prior:

  • Use Case: JZS prior combines the Jeffreys prior on the error variance and the Zellner-Siow Cauchy prior on regression coefficients. It’s useful when you want a default informative prior.

  • Pros: It provides a well-structured, informative prior that is robust in many situations.

  • Cons: It may not be suitable if you have strong prior knowledge that conflicts with the default prior structure.

Empirical Bayes (EB) Priors:

  • Use Case: EB priors are used when you want to estimate the hyperparameters (like ‘g’) from the data itself. EB-local estimates ‘g’ within each model, while EB-global estimates a common ‘g’ across all models.

  • Pros: EB priors can adapt to the data and provide a data-driven prior.

  • Cons: They may not perform well with very limited data, and the results can be sensitive to the choice of algorithm and prior distribution for hyperparameters.

Here we explore the model fitted under another prior, the g-prior to investigate how the prior selection affects the model output:

g_model <-bas.lm(BMXBMI ~ .,
  data = nhanes_URX,
  prior = "g-prior",
  modelprior = uniform(), initprobs = "eplogp", method="MCMC",
  force.heredity = FALSE, pivot = TRUE
)
summary(g_model)
##           P(B != 0 | Y)  model 1     model 2     model 3     model 4
## Intercept    1.00000000   1.0000   1.0000000   1.0000000   1.0000000
## URXUCR       1.00000000   1.0000   1.0000000   1.0000000   1.0000000
## URXCNP       0.99956055   1.0000   1.0000000   1.0000000   1.0000000
## URXCOP       0.37822266   0.0000   1.0000000   0.0000000   1.0000000
## URXECP       0.99951172   1.0000   1.0000000   1.0000000   1.0000000
## URXHIBP      0.99980469   1.0000   1.0000000   1.0000000   1.0000000
## URXMBP       0.90156250   1.0000   1.0000000   1.0000000   1.0000000
## URXMC1       0.02456055   0.0000   0.0000000   0.0000000   0.0000000
## URXMEP       0.99936523   1.0000   1.0000000   1.0000000   1.0000000
## URXMHBP      0.99975586   1.0000   1.0000000   1.0000000   1.0000000
## URXMHH       0.78217773   1.0000   1.0000000   0.0000000   0.0000000
## BF                   NA   1.0000   0.6896065   0.4666501   0.1048856
## PostProbs            NA   0.4038   0.2744000   0.1575000   0.0445000
## R2                   NA   0.2716   0.2726000   0.2703000   0.2710000
## dim                  NA   9.0000  10.0000000   8.0000000   9.0000000
## logmarg              NA 865.2906 864.9189775 864.5284360 863.0357263
##               model 5
## Intercept   1.0000000
## URXUCR      1.0000000
## URXCNP      1.0000000
## URXCOP      0.0000000
## URXECP      1.0000000
## URXHIBP     1.0000000
## URXMBP      0.0000000
## URXMC1      0.0000000
## URXMEP      1.0000000
## URXMHBP     1.0000000
## URXMHH      1.0000000
## BF          0.1102259
## PostProbs   0.0418000
## R2          0.2699000
## dim         8.0000000
## logmarg   863.0853882
coef.gmodel <- coef(g_model)
par(mfrow=c(2,2))
# plot(coef.model, mfrow=c(3,3), ask = F)
plot(coef.gmodel, ask = F)

Comparing this model output with the previous model output under the ZS-null prior, we notice that although the ZS-null prior model considers the URXECP predictor to be most important with an inclusion probability of 1.00, the g-prior model considers the URXHIBP predictor to be important with an inclusion probability of 0.9999. However, URXUCR, URXCNP, URXECP, URXHIBP, URXMBP and URXMEP were deemed important by both models. While the predictors have generally similar degrees of importance in both models, it is also worth noting that the posterior distributions of the URX variables have slightly different shapes under the two different priors.

Why Bayesian Model Averaging? Advantages

Model Uncertainty

  • It is important to take account of model uncertainty about statistical structure when making inferences. Oftentimes, there is remaining uncertainty not only about parameters, but also about the underlying true model. In this case, a Bayesian analysis allows one to take into account not only uncertainty about the parameters given a particular model, but also uncertainty across all models combined.

Simultaneous Scenarios

  • Allows users to incorporate several competing models in the estimation process. In theory, BMA provides better average predictive performance than any single model that could be selected. BMA avoids the all-or-nothing mentality that is associated with classical hypothesis testing, in which a model is either accepted or rejected wholesale. In contrast, BMA retains all model uncertainty until the final inference stage, which may or may not feature a discrete decision.

Model Misspecification

  • BMA is relatively robust to model misspecification. If one does select a single model, then one had better be sure of being correct. With BMA, a range of rival models contribute to estimates and predictions, and chances are that one of the models in the set is at least approximately correct.