brms
package using different priors.brms
package using different priors.This tutorial will make use of the following R libraries.
library(brms)
library(tidyverse)
library(bayesplot)
library(rstanarm)
library(broom)
library(ggplot2)
library(patchwork)
library(pander)
library(latex2exp)
This tutorial will be using the nhanes
dataset where the
variables are described in the file nhanes-codebook.txt
.
This data can be loaded with the load
function, specifying
the rda data file.
load(file = 'nhanes1518.rda')
There are two schools of thought in the world of statistics: frequentist, which we have been working with up to this point, and Bayesian. In the frequentist school, probability is defined as a frequency. For example, if we roll a fair six-sided die, the probability we will roll a 1 is \(P(Y = 1) = \frac{1}{6}\). The conclusions we make in frequentist statistics are based on likelihood; we retrieve data and then make an inference based on the data.
In the Bayesian school, probability is defined as a belief. For example, a Bayesian statement could be “I believe the probability that it rains tomorrow is 30%.” Different people obviously have different beliefs, so another statistician stating “I believe the probability that it rains tomorrow is 40%” could be just as valid. Like frequentist statistics, our conclusions rely on likelihood. However, unlike frequentist statistics, our conclusions combine the likelihood with another factor: our prior knowledge. When we collect data, we then combine our prior knowledge and our likelihood to update our prior knowledge. For simplicity, we can simply refer to our prior knowledge that we wish to update as the prior. After we see our data, we combine our likelihood with our prior to create the posterior.
It is important to understand that different statisticians can have different priors for the same data. For example, a statistician who lives in the desert may believe that the probability that it rains tomorrow is 5%. However, if they move to the rainforest, their belief on future rain chances will change dramatically. This example analogizes how we combine our prior knowledge (living in the desert leads to a low chance for rain) with our data (the amount of rain after we move to the rainforest) to obtain our posterior (living in the rainforest leads to a high chance for rain).
A statistician can also have no prior knowledge. Suppose, unrealistically, that a statistician has no idea how likely it is to rain on any given day. When that statistician is asked for the probability that it will rain tomorrow, they have no beliefs to go off of. This statistician can then use an uninformative prior, or one that tells us little about the situation at hand, to describe rain chances. A common example of an uninformative prior is a \(Uniform(0, 1)\) distribution, which would mean the probability that it rains tomorrow has an equal chance of being any value from 0 to 1.
But how exactly do we get from the prior and the likelihood to the posterior?
The fundamental concept of Bayesian statistics is the fittingly-named Bayes’ rule. Given a likelihood function \(p(y | \theta)\) and a prior \(p(\theta)\), the posterior \(p(\theta | y)\) stated by Bayes’ rule is \(p(\theta | y) = \frac{p(y | \theta)p(\theta)}{p(y)}\). \(p(y)\) is the marginal likelihood, which we can get by integrating out the \(\theta\) from the likelihood function:
\[ p(y) = \int p(y|\theta)p(\theta)d\theta \]
For example, we can use the Bayes’ rule with our nhanes
dataset to estimate the mean weight (BMXWT
) of the
population. We may assume that the weights in our data are independent
and identically distributed from a Normal distribution \(y_i | \mu\sim N(\mu, \sigma^2)\). We also
may have some prior knowledge of the mean weight \(\mu\); let’s say our belief is that \(\mu\) also follows a normal distribution,
\(\mu ~ \sim N(\mu_0, \sigma_0^2)\). By
plugging in our likelihood function \(p(y|\mu)\) and our prior \(p(\mu)\) and doing some algebra and
calculus, we would obtain a posterior \(p(\mu|y)\):
\[ \mu | y \sim N\left(\frac{\sigma_0^2}{\frac{\sigma^2}{n}+\sigma_0^2} \space \bar{y} \space + \frac{\sigma^2}{\frac{\sigma^2}{n}+\sigma_0^2} \space \mu_0 \space, \space \left(\frac{1}{\sigma_0^2} \space + \space \frac{n}{\sigma^2}\right)^{-1}\right) \]
We can see that the posterior mean (the first parameter of the normal distribution above) is a weighted sum of the prior mean \(\mu_0\) and the sample mean \(\bar{y}\). This helps illustrate how the Bayes’ rule combines the information from our data and from our prior in order to form the posterior.
This is a rather simple example, since the function for \(p(y)\) ends up being a closed-form expression and we are working with common distributions. Particularly, this example makes use of a conjugate prior: this occurs when the prior and posterior all follow the same probability distribution. In this case, the prior and likelihood both came from a Normal distribution, and the posterior also ended up being normally distributed. Statisticians often prefer using conjugate priors as it makes computation much simpler. However, conjugate priors may not be available or well-suited for every scenario. Under other priors, we may not be able to obtain a closed-form expression for \(p(y)\). In these situations where performing calculations proves to be extremely complex, we will not be able to apply the Bayes’ rule in order to retrieve the posterior. We would then have to use other sampling schemes–for example, Markov chain Monte Carlo, which will be touched on later–in order to obtain the samples that follow the posterior distribution.
We have now discussed both the prior and the posterior, which correspond to the probability distributions of parameter \(\theta\) before and after we obtain the data respectively. In Bayesian statistics, \(\theta\) is a random variable or a quantity that can take on different values based on different outcomes of an experiment. Hence, \(\theta\) comes from a probability distribution. This is a key distinction from frequentist statistics, where \(\theta\) is a fixed value.
The different perspectives that frequentist and Bayesian statisticians have for what \(\theta\) is correspond to different methods for inference. For example, frequentists may obtain the maximum likelihood estimator for \(\theta\) and construct the confidence interval around this estimate. A limitation of this approach is that the probability is defined for the procedure of constructing the confidence interval, but it is not defined for a specific fixed sample. Thus, when interpreting a 95% confidence interval for example, we cannot say “there is a 95% chance that the confidence interval includes the true population parameter.” Instead, we say that if we drew many samples from the population and created a 95% confidence interval for each sample, the true population parameter will be found in 95% of the intervals.
In Bayesian statistics, we obtain the posterior probability distribution of \(\theta\) by applying Bayes’ rule to update our prior belief. We use the posterior distribution to make inference. For example, Bayesian statisticians usually use the mean of the posterior probability distribution as a point estimator. Then, they use quantiles of the posterior distribution to construct a credible interval, the Bayesian answer to a confidence interval. These have simpler interpretations.
To illustrate credible intervals, let’s assume that the posterior distribution of \(\theta\) is a \(Uniform(0, 1)\) distribution. This means \(\theta\) has an equal chance of being any value between 0 and 1. A 95% credible interval is between any lower bound \(L\) and any upper bound \(U\) such that the posterior probability of \(L < p < U = 0.95\). We use the quantiles of our posterior distribution for \(\theta\) to obtain \(L\) and \(U\). In the case of a 95% interval, these are the 0.025 and 0.975 quantiles (the middle 95% of the data). For a \(Uniform(0, 1)\) distribution, this means \(L = 0.025\) and \(U = 0.975\). We can then say: “There is a 95% probability that the true population parameter lies between 0.025 and 0.975, given the evidence provided by the observed data.”
In statistics, we use linear regression to model the relationship between a response variable and one or more explanatory variables. Linear regression is done differently in frequentist statistics and Bayesian statistics.
In simple linear regression, we have one explanatory variable. Simple linear regression takes the form \(y = \beta_0 +\beta_1 x +\epsilon\), where \(y\) is the response variable, \(x\) is the explanatory variable, \(\beta_1\) is the true slope of the relationship between \(x\) and \(y\), \(\beta_0\) is the true intercept of the relationship between \(x\) and \(y\), and \(\epsilon\) is the random error term.
In frequentists’ statistics, linear regression uses point estimates for parameters \(\beta_0\) and \(\beta_1\) in order to help predict the value of the response variable based on the explanatory variable. The formula for Bayesian regression may appear to be the same, but now, our parameters are drawn from probability distributions. A vague prior suggests we have little prior knowledge on coefficients. As a result, the regression lines will spread out the space. In contrast, if we have strong prior on parameters, the line will be concentrated on a smaller space. Here we observe from the visualization, the prior is quite vague, after we see the data and update our prior, we will begin to update the prior with the data to obtain posterior which is more concentrated within a small range.
The assumptions for Bayesian linear regression are the same for the frequentist method, being as follows:
It is important that these assumptions are not violated to preserve the validity of the regression model. Later, we will learn how to verify that our model fits these assumptions.
As discussed earlier, we perform simple linear regression when we
want to explore the relationship between one explanatory variable and
one response variable. For example, we might want to better understand
the relationship between body mass index and age for a variety of
reasons–perhaps we want to know if different age groups are more prone
to obesity. We can help answer this question with a simple linear
regression model using the nhanes
dataset.
For this linear model, we want to predict body mass index based on
age; therefore, we will use BMXBMI
as our response variable
and AGE_C
as our explanatory variable. We will limit our
analysis to adults. We center the age at 18 years to facilitate the
interpretation of intercept as the mean BMI value for 18 year olds.
nhanes1518 <- nhanes1518 %>%
filter(RIDAGEYR >= 18) %>%
mutate(AGE_C = RIDAGEYR - 18)
We will use the brms
library to fit our Bayesian model.
An important part of creating a Bayesian regression model is
establishing our priors for the parameters. Based on our prior knowledge
of the data, there is a variety of priors we can set (and thus a variety
of models we can create). Later, we will discuss strategies for
selecting model priors. For now, we will assume that we have no prior
knowledge of the dataset, using a noninformative prior. Notice the
default prior in brms
is heavy-tail t-distribution with a
degree of freedom df = 3, mean and sd adjusted based on data. From the
plot below, you can observe t-distribution has heavier tail compared to
a normal distribution. As a result, a t-distribution prior is less
sensitive to extreme values and thus the resulting model is more
robust.
We can use the brm
function to specify our model. The
formula
argument of this function follows the format
y ~ x
, with y
being the response variable and
x
being the explanatory variable. We will save our model in
a variable named model1
. Since Bayesian model fitting
relies on random sampling, we specified the seed here to make sure it is
reproducible. Here we just picked the seed at convenience. Please refer
to this R-blog
for picking a more scientific random seed.
model1 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518,
seed = 123)
Once we have created our model, we will then interpret its results.
Our model gives us posterior distributions for \(\beta_{Intercept}\) and \(\beta_{Age}\); we can plot these with the
mcmc_dens
function of the bayesplot
package,
specifying the distributions we want to plot with the pars
argument.
mcmc_dens(model1, pars = c("b_Intercept", "b_AGE_C"))
We can also use the summary
function to view detailed
statistics of full posteriors. We will set the priors
argument to TRUE
so that the summary includes the prior
distributions we used for our model. Here the default priors for
coefficients and variance of error term are set to be t-distributions by
default.
summary(model1, priors = TRUE)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: BMXBMI ~ AGE_C
## Data: nhanes1518 (Number of observations: 11096)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Priors:
## Intercept ~ student_t(3, 28.4, 6.4)
## <lower=0> sigma ~ student_t(3, 0, 6.4)
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 28.98 0.13 28.73 29.25 1.00 4417 3603
## AGE_C 0.02 0.00 0.01 0.03 1.00 5296 3190
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 7.26 0.05 7.17 7.36 1.00 3746 2944
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The first three lines specify the data model \(y_i|\mu,\sigma \sim N(\beta_{int}+\beta_{age}
age_i,\sigma^2)\). The following two lines specify the dataset
and settings of MCMC. We ran four chains, each chains have 2000
iterations with the first 1000 as warmup (discarding in the final
analysis). Notice here we set iter = 2000
is only for
illustration. In practice, we usually set iteration to be at least 10000
to ensure the convergence of MCMC.
Suggest the prior of intercept is t-distribution with mean 28.4, sd
6.4, and df 3. The prior of \(\sigma\)
is half t-distribution with mean 0, sd 6.4, and df 3. These are the
default priors specified by brms
package since we do not
specify any prior in our code. The mean 28.4 and sd 6.4 are estimated
from data.
Notice the default prior for coefficients \(\beta_{age}\) is flat prior, which is a
constant on the real line. \(p(\beta)\propto
1, \beta \in (-\infty,\infty)\). If you wish to obtain the full
priors for all the parameters, use get_prior(model1)
.
This provides summary statistics of posterior distribution of \(\beta_{int}\) and \(\beta_{age}\). Estimate
provides posterior mean, and l-95\% CI
,
u-95\% CI
provide 0.025- and 0.975- quantiles.
This provides summary statistics of posterior distribution of \(\sigma\). Similarly Estimate
provides posterior mean, and l-95\% CI
,
u-95\% CI
provide 0.025- and 0.975- quantiles.
Similar to frequentists, we have point estimates for \(\beta_{Intercept}\) and \(\beta_{Age}\) that we are interested in.
These are the means of the posterior distributions we plotted with the
mcmc_dens
function. The estimates can be found in the
Intercept
and BMXBMI
rows of the
summary
output. These tell us that point estimate for the
intercept is 28.98, and the point estimate for our slope is 0.02. With
these statistics, let’s now fill in our model:
\[\hat{BMI} = 28.98 + 0.02(Age)\]
We can interpret the estimate of the slope to mean that we expect those who are 1 year older to have BMI 0.02 kg/m^2 higher, on average. For a person at age 18, his/her BMI is expected to be 28.98.
We are also interested in the l-95% CI
and
u-95% CI
columns of the summary
output. These
estimates are obtained from the credible intervals of the posterior
distributions we plotted with mcmc_dens
. We can thus state
that there is a 95% chance that the average BMI for an individual at age
18 lies between 28.28 and 29.04. We have high confidence in the result
that BMI tends to be higher for those who are older .
Now that we have a credible interval for our slope, we might want to
conclude whether there is a statistically significant relationship
between BMI and age. To do so, we can examine whether or not \(\hat{\beta_{Age}}\) is statistically
different from 0. Our null hypothesis in this case is
that the slope is equal to 0, while our alternative
hypothesis is that the slope is not equal to 0 (which implies a
linear relationship between BMI and age). We can answer this question
with our credible interval. Since the interval for AGE_C
does not include 0, we can reject our null hypothesis and conclude that
there is likely a linear relationship between age and body mass index
(although it appears to be very small).
Suppose we want to guess an individual’s BMI using our model. We can do this with a prediction interval; it differs from using normal confidence intervals as confidence intervals only express sampling uncertainty from our data. On the other hand, prediction intervals express sampling uncertainty from our data and from the single value with which we want to predict. As a result, prediction intervals are wider than confidence intervals.
We can use the predict
function with our model to
predict individual’s BMIs given an age of 18, 30, and 60.
set.seed(123)
predict(model1,
newdata = data.frame(AGE_C = c(18, 30, 60)-18),
interval = "prediction",
level = 0.95) %>%
knitr::kable(digits = 2)
Estimate | Est.Error | Q2.5 | Q97.5 |
---|---|---|---|
29.04 | 7.22 | 14.65 | 43.32 |
29.15 | 7.24 | 14.82 | 42.94 |
29.77 | 7.30 | 15.29 | 44.08 |
Our prediction interval estimates that a 18-year-old individual likely has a BMI between 14.765 and 43.32. As we can see, the interval is extremely wide, limiting our ability to make meaningful conclusions. We will later see in section Posterior Predictive Check that a transformation might this situation.
Now that we’ve tackled simple linear regression, let’s learn how to
do multivariate linear regression the Bayesian way. We
do multivariate regression when we want to know how one variable is
impacted by multiple other variables. We can also use this method when
we want to control for some variable. For example, in our
nhanes
dataset, we might want to know how factors like age,
waist size, weight, and height impact BMI; we might also want to know
how waist size, weight, and height impact BMI if we control for age. To
help answer these questions, we can fit a model with BMXBMI
as the response and AGE_C
, BMXWAIST
as the
predictors.
Let’s first visualize the distribution of all of our predictor variables in the dataset. Notice the age is top-coded, which means age above 80 (62 after centering at 18) years are censored.
p1 <- ggplot(nhanes1518, aes(x = AGE_C)) +
geom_histogram(color = "black", fill = "red") +
labs(title = 'Distribution of Age in Dataset')
p2 <- ggplot(nhanes1518, aes(x = BMXWAIST)) +
geom_histogram(color = "black", fill = "blue") +
labs(title = 'Distribution of Waist Size in Dataset')
p1 + p2 +
plot_layout(ncol = 2)
As we can see from the x-axis labels, our predictors are not measured
in the same way; age is in years, waist is in centimeters. This leads to
these predictors having different ranges and, therefore, different
variances. If we used this data as is, we would have to specify
different variances for the prior distributions of each predictor. For
simplicity purposes, we can use the scale
function in R to
normalize our data so that each predictor has a mean of 0 and a standard
deviation of 1. This will allow us to specify the same prior for each of
our predictor variables. Note that, in this example, scaling our data is
for convenience; in real-world scenarios, this may not be the best
approach. It’s important to mention that scaling isn’t required when
utilizing the default prior in the package. The package automatically
adjusts the prior according to the data.
nhanes1518['AGE_C_sc'] <- scale(nhanes1518$AGE_C)
nhanes1518['BMXWAIST_sc'] <- scale(nhanes1518$BMXWAIST)
Again, we will begin by assuming we have no prior knowledge of the
data and use the default the prior in brms
. We will again
use the brm
function to create our model. The format for
the formula
argument here is
y ~ x1 + x2 + .. + xn
, where each \(x_i\) is an explanatory variable.
multi_model1 <- brm(formula = BMXBMI ~ AGE_C_sc + BMXWAIST_sc,
data = nhanes1518,
seed = 123)
We can once again interpret our model’s results with the
summary
function.
summary(multi_model1, priors = TRUE)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: BMXBMI ~ AGE_C_sc + BMXWAIST_sc
## Data: nhanes1518 (Number of observations: 10534)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Priors:
## Intercept ~ student_t(3, 28.3, 6.4)
## <lower=0> sigma ~ student_t(3, 0, 6.4)
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 29.41 0.03 29.36 29.47 1.00 3805 2727
## AGE_C_sc -1.12 0.03 -1.18 -1.07 1.00 4210 2959
## BMXWAIST_sc 6.67 0.03 6.62 6.73 1.00 3808 3128
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 2.77 0.02 2.73 2.80 1.00 5729 3542
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
With these statistics, we can now fill in our model:
\[\hat{BMI} = 29.41 -1.12Age\_sc + 6.67Waist\_sc\]
We can interpret the estimate of the slope of age to mean that for every one standard deviation increase in scaled age, we expect BMI to decrease by 1.12 on average, holding all other factors constant. Since we scaled our data, we also have a meaningful interpretation for our intercept: we expect a person with the average age, waist size to have a BMI of 29.41.
We can again look at the l-95% CI
and
u-95% CI
columns to obtain 95% credible intervals for our
intercept and estimates. Since all of the intervals do not include 0, we
can reject our null hypothesis for all variables and conclude that,
holding all other factors constant, there is likely a linear
relationship between age and BMI, and waist size and BMI.
So far in this tutorial, we have used the default priors for our linear regression models. However, one of the foundations of Bayesian statistics is setting an informative prior, or one that tells us something about the data.
Prior selection is especially important when we do not have a lot of data. In these cases, since there is not much data to go off of, much of our inference comes from the prior. This means it is vital that we select a fitting distribution, or we risk an inaccurate posterior.
Different priors have different motivations and properties. Some scrupulous scientists want to be as objective as possible when choosing a prior, while others may aim to choose a prior that satisfies experts with different opinions. This section will discuss common priors used for coefficients (\(\beta\)) and variances (\(\sigma^2\)).
Since we have thousands of observations in our dataset, the different
priors we may set would yield similar results. To better demonstrate how
choosing different priors can impact our model, let’s take a sample of
our dataset of size n = 30
.
set.seed(456)
nhanes1518_sample <- sample_n(nhanes1518, 30)
Now, using our sampled data, let’s see the results of a model using default priors.
default_prior <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
seed = 123)
summary(default_prior, priors = TRUE)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: BMXBMI ~ AGE_C
## Data: nhanes1518_sample (Number of observations: 26)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Priors:
## Intercept ~ student_t(3, 27.5, 5.5)
## <lower=0> sigma ~ student_t(3, 0, 5.5)
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 34.03 3.81 26.44 41.49 1.00 3345 2575
## AGE_C -0.11 0.09 -0.30 0.08 1.00 3624 2446
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 9.43 1.37 7.20 12.58 1.00 2973 2644
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
We get the model \(\hat{BMI} = 34.03 - 0.11(Age)\).
We can use prior_summary
to obtain the default prior. We
can see the default prior for intercept and sigma (sd of error term) is
t-distribution, and default prior for slope is a flat prior (\(f(\beta) \propto 1, \beta \in
(-\infty,\infty)\)). We will compare this to models using
specified priors.
prior_summary(default_prior)
## prior class coef group resp dpar nlpar lb ub
## (flat) b
## (flat) b AGE_C
## student_t(3, 27.5, 5.5) Intercept
## student_t(3, 0, 5.5) sigma 0
## source
## default
## (vectorized)
## default
## default
A common prior for the coefficient (or \(\beta\)) in Bayesian inference deals with a normal (Gaussian) distribution. This is a conjugate prior, meaning that the posterior will also be a normal distribution, but with different parameters. This might be a good choice for our beta prior if we assume our data is normally distributed, meaning most values are situated in the middle, and there is a symmetrical decline in the frequency of values as we reach the ends.
We will use the same BMXBMI ~ AGE_C
model we worked with
in the single linear regression section. Let’s assume the BMI values in
our sample are normally distributed. We will choose a normal
distribution with \(\mu = 0\) and \(\sigma = 0.3\) for \(\beta_{Age}\). We can use the
prior
argument in the brm
function to set our
prior for \(\beta\). In this argument,
we set class = "b"
and coef = "AGE_C"
to
signify that this is a prior on \(\beta_{Age}\). Notice that the
parametrization in the function takes the form “normal(mean,sd)”.
model2 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
prior = c(set_prior("normal(0, 0.3)", class = "b", coef = "AGE_C")),
seed = 123)
Now, let’s use the mcmc_intervals
function of the
bayesplot
package to compare the values of \(\beta_{Age}\) for each of our two models so
far:
(mcmc_intervals(
default_prior,
point_est = "mean",
prob_outer = 0.95,
pars = c("b_AGE_C")) +
coord_cartesian(xlim = c(-0.35, 0.1)) +
labs(title = "Default Prior (Flat Prior)") +
theme_minimal()
) /
(mcmc_intervals(
model2,
point_est = "mean",
pars = c("b_AGE_C")) +
coord_cartesian(xlim = c(-0.35, 0.1)) +
labs(title = TeX("Normal(0, $0.3^2$) Coefficient Prior")) +
theme_minimal()
)
The thick line represents a 50% credible interval, while the thin line represents a 95% credible interval. As we can see, both models have similar point estimates for the mean of \(\beta_{Age}\), but the intervals differ in length. Our \(N(0, 0.3^2)\) model has a narrower interval, likely because we supplied a prior with a small variance. This signifies how different priors can yield different posteriors.
Let’s run another model with a normally distributed \(\beta\) prior, using the same mean but inputting a much larger variance.
model3 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
prior = c(set_prior("normal(0, 100)", class = "b", coef = "AGE_C")),
seed = 123)
mcmc_intervals(
model3,
point_est = "mean",
prob_outer = 0.95,
pars = c("b_AGE_C")) +
coord_cartesian(xlim = c(-0.35, 0.1)) +
labs(title = TeX("Normal(0, $100^2$) Coefficient Prior")) +
theme_minimal()
The results of this model more closely resemble our default model as compared to our model with the \(N(0, 0.3^2)\) prior. This is because our extremely large variance means we are very unsure about the value of \(\beta_{Age}\); thus, this prior is largely noninformative, similar to the default.
Let’s recap the models we’ve created so far with their results:
Another common prior for the coefficient in Bayesian inference deals with the student t-distribution. This differs from the normal distribution in that the values are less clustered in the middle, giving it heavier tails on the ends. Using the t-distribution can offer more robust estimation; it is often used on smaller, approximately normal datasets (for example, when n < 30). For the purpose of this section, we will proceed using a t-distribution to estimate \(\beta_{Age}\).
We can change the shape of our t-distribution by altering the degrees of freedom; the more degrees of freedom we have, the more we will resemble a normal distribution (so, the less heavy the tail ends will be). Let’s fit a model with a beta prior distributed as a unit t-distribution with 3 degrees of freedom:
model4 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
prior = c(set_prior("student_t(3, 0, 1)", class = "b", coef = "AGE_C")),
seed = 123)
mcmc_intervals(
model4,
point_est = "mean",
prob_outer = 0.95,
pars = c("b_AGE_C")) +
coord_cartesian(xlim = c(-0.35, 0.1)) +
labs(title = TeX("Student_T(3, 0, 1) Coefficient Prior")) +
theme_minimal()
Again, our model results resemble those we have previously calculated. For demonstrative purposes, let’s also run a model with more degrees of freedom (a more Gaussian-like distribution).
model5 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
prior = c(set_prior("student_t(30, 0, 1)", class = "b", coef = "AGE_C")),
seed = 123)
mcmc_intervals(
model5,
point_est = "mean",
prob_outer = 0.95,
pars = c("b_AGE_C")) +
coord_cartesian(xlim = c(-0.35, 0.1)) +
labs(title = TeX("Student_T(30, 0, 1) Coefficient Prior")) +
theme_minimal()
And now let’s once again recap our models:
Upon selecting our priors and running our models, we can then use our
posterior information to conduct further regression. Usually, regardless
of the initial choice of prior, results will end up converging to the
same posterior after many iterations. In this case, since our sample
size is now very small (n = 30
), it might be more sensible
to start with a student-t distribution for our \(\beta_{Age}\) parameter, since
t-distribution has a heavier tail, is more robust.
Now let’s move on to our variance (or \(\sigma^2\)) parameter. A common prior for the variance in Bayesian inference deals with the inverse gamma distribution. This particular distribution restricts the variance to only positive values (we cannot have a negative variance as it is the value of the standard deviation squared). The inverse gamma distribution takes two parameters, \(\alpha\) and \(\beta\), which impact the shape (different permutations of these parameters are pictured above). It is also a conjugate distribution, so our posterior will be an inverse gamma distribution.
Let’s run two models with inverse-gamma variance priors, using the
parameters of the light blue line and the red line in the visualization
above. We will use our nhanes1518_sample
dataset that we
created in the previous section.
model6 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
prior = c(set_prior("inv_gamma(3, 0.5)", class = "sigma")),
seed = 123)
model7 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
prior = c(set_prior("inv_gamma(1, 1)", class = "sigma")),
seed = 123)
Another common prior for the variance in Bayesian inference deals with the Half-Cauchy distribution. The Cauchy distribution is equivalent to a student t-distribution with 1 degree of freedom. We then restrict this to only contain positive values (as again, our variance cannot be negative) to get the Half-Cauchy distribution. The Half-Cauchy takes two parameters, \(\mu\) and \(\sigma\), which impact the shape (two permutations of these parameters are pictured above). Similar to the student-t distribution, the Half-Cauchy has a heavy tail and provides more robust estimation.
Let’s run two models with Half-Cauchy variance priors, using the
parameters of the two examples pictured above. We will continue to use
our nhanes1518_sample
dataset. If we specify our
sigma
prior to be a cauchy
, the
brms
package will automatically restrict it to positive
values, thus effectively making it a Half-Cauchy.
model8 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
prior = c(set_prior("cauchy(0, 0.1)", class = "sigma")),
seed = 123)
model9 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
prior = c(set_prior("cauchy(0, 1)", class = "sigma")),
seed = 123)
Let’s visualize the results from all of our variance models with the
mcmc_intervals
function
Notice specifically how the model with the \(InvGam(3, 0.5)\) prior for the variance sticks out from the rest in terms of posterior \(\sigma\). This is likely because of the prior’s large \(\alpha\) parameter, which forces the prior to very sharply fit the data. Meanwhile, the model with the \(Half-Cauchy(0, 1)\) prior for the variance has the widest credible interval. This is likely because of the prior’s large \(\sigma\) parameter, which corresponds to the prior weakly fitting the data.
When we construct a multivariate regression model, we have more
parameters for which we can set a prior. For this section, we will
construct the same model as in the section on multivariate regression,
with BMI as the response and age, waist size as the predictors. However,
we will now use the nhanes1518_sample
dataset. We first
need to scale our variables.
nhanes1518_sample['AGE_C_sc'] <- scale(nhanes1518_sample$AGE_C)
nhanes1518_sample['BMXWAIST_sc'] <- scale(nhanes1518_sample$BMXWAIST)
Now, let’s create a model with the default parameters, which we will compare to a model with specified priors. The default priors are: flat priors for coefficients, t(3, 27.3, 5.6) for intercept, and half-t(3, 0, 5.6) for sigma.
default_multi <- brm(formula = BMXBMI ~ AGE_C_sc + BMXWAIST_sc ,
data = nhanes1518_sample,
seed = 123)
get_prior(formula = BMXBMI ~ AGE_C_sc + BMXWAIST_sc ,
data = nhanes1518_sample,
seed = 123)
## prior class coef group resp dpar nlpar lb ub
## (flat) b
## (flat) b AGE_C_sc
## (flat) b BMXWAIST_sc
## student_t(3, 27.3, 5.6) Intercept
## student_t(3, 0, 5.6) sigma 0
## source
## default
## (vectorized)
## (vectorized)
## default
## default
In brms
, we can set the prior for each coefficient
separately. For this model, we will use a normal distribution prior for
each \(\beta\) with a mean of 0. Since
we don’t know much about the data, we want our prior to be relatively
flat; thus we will use a relatively large variance of 100. We will also
use a simple \(HalfCauchy(0, 1)\) prior
for sigma, and a relative flat \(N(0,
22)\) prior for the intercept.
After we create our model, we will compare the results of the two
models with mcmc_intervals
.
priors2 <- c(set_prior("normal(0, 10)", class = "b", coef = 'AGE_C_sc'),
set_prior("normal(0, 10)", class = "b", coef = 'BMXWAIST_sc'),
set_prior("cauchy(1, 1)", class = "sigma"),
set_prior("normal(0, 22)", class = "Intercept"))
multi_model2 <- brm(formula = BMXBMI ~ AGE_C_sc + BMXWAIST_sc ,
data = nhanes1518_sample,
prior = priors2,
seed = 123)
As we can see, the results of both models are very similar. This makes sense, as all of our specified priors in the second model were very unspecific (due to large variances). If we fit sharper priors, we could expect to see a larger difference in the posterior from the default model.
Prior selection is a pivotal step of the Bayesian process. An informative prior can help make up for a lack of available data, allowing us to still obtain a posterior from which we can make meaningful interpretations and inferences.
This section goes over two common priors for both \(\beta\) and \(\sigma^2\) values. Using a normal distribution for \(\beta\) and an inverse-gamma distribution for \(\sigma^2\) can be a good choice if we believe most of our data is not too far from the mean. However, Bayesian statisticians tend to use a t-distribution for \(\beta\) and a Half-Cauchy distribution for \(\sigma^2\) as these are insensitive to outliers. Regardless, there is no predefined “best answer” on what prior to use in any given situation; it depends on our knowledge of the information.
There also exists a special diagnostic for Bayesian models: MCMC diagnostics. Most of the time, we usually cannot determine the exact posterior distribution from our model, but we can use MCMC sampling to draw samples from the posterior distribution. Therefore, we need to check MCMC diagnostics to ensure that the MCMC samples we obtain can accurately represent the true posterior distribution.
In order to see if our Bayesian analysis is working and performing well, we need to examine Markov chain Monte Carlo (MCMC) diagnostics. In statistics, the Monte Carlo method refers to when we generate a sample of independent draws from the probability distribution of a random variable \(X\) in order to estimate a feature of that distribution. The Markov chain Monte Carlo method is similar, but in this case the items in the generated sample are serially correlated rather than independent.
During model evaluation, samples from our posterior distributions
will be generated using the MCMC method; we then can use trace
plots to check convergence of our samples. Trace plots are line
charts showing time on the x-axis, and values from the sample draws on
the y-axis. We can use the plot()
function in the
brms
package to display trace plots.
Let’s first examine a model that passes MCMC diagnostics. We will
continue to use the nhanes1518_sample
dataset, and we will
focus on simple linear regression models.
mcmc1 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
seed = 123)
plot(mcmc1)
The plots on the right side are the trace plots. We want to see
something that represents a fuzzy caterpillar; this means that the
generated draws have low serial correlation and that the sample space is
well explored. All three of our trace plots for this model fit the
criteria. We can also do summary(mcmc1)
and look at the
values in the Rhat
column; a value of 1 means our Markov
chains have converged sufficiently, which is what we are looking
for.
Now, let’s examine some models that fail MCMC diagnostics. Let’s run
the same model as the previous example, but set the iter
parameter to equal 50 (brms
defaults this parameter to
2000).
mcmc2 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
seed = 123,
iter = 50)
plot(mcmc2)
Here, we see our trace plots look vastly different than the ones in
the iter = 2000
example. With plots like this, we can infer
that the generated draws have high serial correlations and that the
chains have not yet converged around any value. This means that the
effective size of our sample is too small; we can fix this issue by
increasing the number of iterations (in this case, changing the
iter
parameter when fitting our model).
For this next model, we will use the same structure as the previous
two models, except we will set the iter
parameter to 400
and the warmup
to 10 (the warmup
parameter
defaults to iter/2
).
mcmc3 <- brm(formula = BMXBMI ~ AGE_C,
data = nhanes1518_sample,
seed = 123,
iter = 400,
warmup = 10)
plot(mcmc3)
Here, we see that the first portions of the trace plots look very
different than the rest; this means our chains are eventually converging
to the target distribution (as we get to the ‘fuzzy caterpillar’
portion), but our initial distributions are largely different than the
target. We can fix this by increasing the warmup
parameter,
which discards a number of the initial observations.
Trace plots allow us to examine the extent to which serial correlation exists among the draws from our model. We can look at this serial correlation more specifically by plotting the autocorrelation function (ACF) of our model.
Let’s plot the autocorrelation for mcmc1
, the model that
we just showed passes MCMC diagnostics. We can use the
mcmc_acf
function of the bayesplot
package to
do so, using the pars
argument to specify which ACFs we
want to see. Recall that an assumption for Bayesian linear regression is
the presence of no autocorrelation; thus, we want the ACFs to quickly
approach zero.
mcmc_acf(mcmc1, pars = c("b_Intercept", "b_AGE_C", "sigma"))
The ACFs for this model all quickly approach zero, meaning this is a valid model in terms of the no autocorrelation assumption.
Let’s compare these plots to those for the mcmc3
model,
which we just showed fails MCMC diagnostics.
mcmc_acf(mcmc3, pars = c("b_Intercept", "b_AGE_C", "sigma"))
These plots take much longer to reach zero, indicating a high degree of serial correlation. We would likely want to refit this model.
Autocorrelation is closely related to another indicator used in model evaluation: the effective sample size. This metric, often notated as \(n_{eff}\), estimates the total number of independent draws made from sampling the posterior distribution. Autocorrelation, of course, means that some sequential draws will not be independent. Thus, \(n_{eff}\) is usually smaller than the overall sample size \(N\).
We can examine effective sample size using the
neff_ratio
and mcmc_neff
functions from the
bayesplot
package. neff_ratio
calculates the
ratio \(\frac{n_{eff}}{N}\) for each
parameter specified in the pars
argument.
mcmc_neff
then takes a vector of these ratios and plots
them. Let’s continue to use mcmc1
, our diagnostic-passing
model, and plot its effective sample sizes. We add
theme_minimal
to make the plot easier to read.
neff_mcmc1 <- neff_ratio(mcmc1, pars = c("b_Intercept", "b_AGE_C", "sigma"))
mcmc_neff(neff_mcmc1) + theme_minimal()
This plot shows that the effective sample sizes for the parameters of this model are all close to the overall sample sizes. We want to see ratios closer to 1 (indicated by the lightest blue lines), as these indicate the lowest levels of autocorrelation.
Let’s compare these results to those for mcmc3
, one of
our diagnostic-failing models.
neff_mcmc3 <- neff_ratio(mcmc3, pars = c("b_Intercept", "b_AGE_C", "sigma"))
mcmc_neff(neff_mcmc3) + theme_minimal()
As we can see, the effective sample size ratios are much lower for this model, as indicated by darker blue lines. This means there are very little amounts of independent draws from the posterior distributions, indicating high levels of autocorrelation. As concluded earlier, we would want to refit this model.
In the section on single linear regression, we mentioned a list of assumptions for Bayesian linear regression that help ensure the validity of our model. In frequentists’ framework, we usually check the assumptions for a linear model using residual plots (Check module on frequentist linear regression for more details). We can do the similar check in Bayesian statistics.
In addition, Bayesian can construct posterior check, generating posterior samples and compare them with original data to evaluate the model performance.
In Bayesian estimation, we can obtain posterior samples \(\theta = (\theta^{(1)},\dots,\theta^{(N)})\). From which we can generate posterior predictions of response variables \(Y^* = (Y^*{(1)},\dots,Y^*{(N)})\), to check whether they are close to the original data. In addition, we can construct any statistics based on \(\theta\) or \(Y^*\), and compare these posterior samples to the feature of original data. If the data-based statistics lie at the tail of distribution of the posterior samples, we may conclude the model does not fit the data well at least cannot capture the desired feature well.
The function pp_check
can generate posterior samples
\(Y^*\) and plot it with the original
data. We can observe the posterior samples deviates from original data,
which does not seems a good fit.
pp_check(model1,nreps = 50)
We know the reason in previous tutorial module on frequentist linear regression that BMI has long tail, and thus the error term does not follow normal distribution. Here we fit the regression model use log(BMI) and construct posterior predictive check. It shows the model fits the data well.
model1log <- brm(formula = log(BMXBMI) ~ AGE_C,
data = nhanes1518,
seed = 123)
pp_check
helps to compare the distribution of orginal
data and posterior samples. Suppose we are interested in evaluating the
performance of our model when predicting the BMI of age 18. We can
generate responses given age equals to 18 and check how the true data
locate in the posterior distribution. posterior_predict
can
help us generated multiple sets of posteriors samples given the new
data. ppc_intervals
can help us visualize the credible
intervals and compare them with original data.
To ease computation, we draw 1000 samples from the data to evaluate the performance. For each data point, we generate 4000 posterior samples for different ages. We can observe the posterior interval fail to capture a few data points. There are some other important factors need to be introduced to the model.
set.seed(1)
nhanes1518_sample1000=na.omit(nhanes1518%>%select(BMXBMI,AGE_C))
nhanes1518_sample1000 <- sample_n(nhanes1518_sample1000, 1000)
set.seed(1)
predictions <- posterior_predict(model1, newdata = nhanes1518_sample1000)
dim(predictions)
## [1] 4000 1000
ppc_intervals(nhanes1518_sample1000$BMXBMI,
yrep = predictions, x = nhanes1518_sample1000$AGE_C,
prob = 0.5, prob_outer = 0.95)
Suppose we have multiple Bayesian linear regression models, and we desire to know which one is “better”. To accomplish this, we can do model comparison to determine which of our models minimizes prediction error.
One metric we can use to compare Bayesian models is the Widely Applicable Information Criterion (WAIC), which is an extension of the AIC metric. This statistic is a measurement of prediction error, and therefore we want to choose the model with the smallest WAIC.
Let’s compare the three models we just used to evaluate MCMC
diagnostics. We can visualize their WAIC values, which are calculated
using the waic()
method in the brms
package.
waic1 <- waic(mcmc1)$waic
waic2 <- waic(mcmc2)$waic
waic3 <- waic(mcmc3)$waic
wx <- c(waic1, waic2, waic3)
barplot(wx, ylab = "WAIC", names.arg=paste("mcmc", 1:3))
We see that the WAIC for mcmc1 is the lowest. This makes intuitive
sense, as this was the model that passed MCMC diagnostics. If we were
comparing these three models, we would proceed with
mcmc1
.
We can also use WAIC to determine what are the best priors to select
for our data. Let’s take a look at the first five models we fitted in
this module, using the nhanes1518_sample
dataset. Model 2
to 5 represent the model with prior \(N(0.0.09)\), \(N(0, 10000)\), \(StudentT(3, 0, 1)\), \(StudentT(30, 0, 1)\) respectively.
waic_default <- waic(default_prior)$waic
waic_mod2 <- waic(model2)$waic
waic_mod3 <- waic(model3)$waic
waic_mod4 <- waic(model4)$waic
waic_mod5 <- waic(model5)$waic
wy <- c(waic_default, waic_mod2, waic_mod3, waic_mod4, waic_mod5)
barplot(wy, ylab = "WAIC", ylim = c(194, 196),
names.arg = c("default", paste("model", 2:5)))
Although all of the models have very similar WAIC scores, model 4 (which uses a \(StudentT(3, 0, 1)\) prior for \(\beta_{Age}\)) slightly edges out the others. We would proceed with model 4. This makes sense as, since our sample size is very small, it is more fitting to use a student-t distribution than a normal distribution for our prior on \(\beta\).
Here is the model comparison between regression with one covariate and two covaraites. Multivariate linear regression has lower waic, suggesting including waist improves model performance.
waic(model1)$waic
## [1] 75490.06
waic(multi_model1)$waic
## [1] 51338.61