Utilize a linear regression to describe the relationship between a response variable and several predictors.
Estimate the coefficients of the regression line using the least squares method.
Interpret the coefficients of the regression line.
Make inference and prediction.
Model diagnosis.
library()
function is used to access functionality that
is provided by R packages, but is not included in base R.
install.packages()
can be used to install new
packages.
Run this command from the console.
# install.packages("tidyverse")
First, load the package tidyverse
that will be used
throughout the tutorial for data visualizations.
library(tidyverse)
This tutorial will be using the nhanes
dataset where the
variables are described in the file nhanes-codebook.txt
.
Load this data with the load
function and specify the
.rda
data file.
load(file='nhanes1518.rda')
The functions head()
and names()
can be
used to explore the data.
head()
can output the first several rows of the
data.
names()
can provide all the names of variables.
head(nhanes1518)
## # A tibble: 6 × 52
## SEQN WTINT2YR WTMEC2YR RIDSTATR RIAGENDR RIDAGEYR INDFMPIR RIDRETH1 DMDEDUC2
## <int> <dbl> <dbl> <int> <int> <int> <dbl> <int> <int>
## 1 83732 134671. 135630. 2 1 62 4.39 3 5
## 2 83733 24329. 25282. 2 1 53 1.32 3 3
## 3 83734 12400. 12576. 2 1 78 1.51 3 3
## 4 83735 102718. 102079. 2 2 56 5 3 5
## 5 83736 17628. 18235. 2 2 42 1.23 4 4
## 6 83737 11252. 10879. 2 2 72 2.82 1 2
## # ℹ 43 more variables: DMDEDUC3 <int>, INDHHIN2 <int>, BMXBMI <dbl>,
## # BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## # URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## # URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## # URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## # URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## # URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …
names(nhanes1518)
## [1] "SEQN" "WTINT2YR" "WTMEC2YR" "RIDSTATR" "RIAGENDR" "RIDAGEYR"
## [7] "INDFMPIR" "RIDRETH1" "DMDEDUC2" "DMDEDUC3" "INDHHIN2" "BMXBMI"
## [13] "BMXWAIST" "BMXWT" "BMXHT" "URXUCR" "WTSB2YR" "URXCNP"
## [19] "URDCNPLC" "URXCOP" "URDCOPLC" "URXECP" "URDECPLC" "URXHIBP"
## [25] "URDHIBLC" "URXMBP" "URDMBPLC" "URXMC1" "URDMC1LC" "URXMCOH"
## [31] "URDMCOLC" "URXMEP" "URDMEPLC" "URXMHBP" "URDMHBLC" "URXMHH"
## [37] "URDMHHLC" "URXMHNC" "URDMCHLC" "URXMHP" "URDMHPLC" "URXMIB"
## [43] "URDMIBLC" "URXMNP" "URDMNPLC" "URXMOH" "URDMOHLC" "URXMZP"
## [49] "URDMZPLC" "WTINT4YR" "WTMEC4YR" "WTSB4YR"
We will illustrate investigating the relationship between age and BMI
as an example of linear regression. In the dataset, age and BMI are
denoted as RIDAGEYR
and BMXBMI
respectively.
We are interested in the relation between age and BMI especially for
adults. To prepare for data analysis, we apply the filter()
function to retain individuals whose age is equal to or greater than
18.
nhanes1518 <- nhanes1518%>%filter(RIDAGEYR>=18)
We present the distribution of age using the
geom_histogram()
function. binwidth
controls
the width of bins in the histogram, color
and
fill
control the color of frame and filling
respectively.
# Basic histogram
ggplot(nhanes1518, aes(x=RIDAGEYR)) +
geom_histogram(binwidth=2, color="black", fill="white")+
labs(x = "Age", title = "Distribution of Age") # legend of the plot
We see that the distribution of age is asymmetric, with peaks at age of around 80 and 0 respectively.
We can also add a vertical line indicating the mean of age, as well
as overlay the histogram with a transparent density plot. The value of
alpha
controls the level of transparency.
# Histogram with density plot
ggplot(nhanes1518, aes(x=RIDAGEYR)) +
geom_histogram(binwidth=2, aes(y=..density..), colour="black", fill="white")+
labs(x = "Age", title = "Density of Age") +
geom_vline(aes(xintercept=mean(RIDAGEYR)),
color="blue", linetype="dashed", size=1) + # Add mean line
geom_density(alpha=.2, fill="#FF6666") # add a layer of density
Similarly, we can explore the distribution of BMI, which we found to be right-skewed:
# Basic histogram
ggplot(nhanes1518, aes(x=BMXBMI)) +
geom_histogram(binwidth=2, color="black", fill="white") +
labs(x = "BMI", title = "Distribution of BMI")
We can also utilize the scatterplot (geom_point
) to
explore the relationship between age and BMI:
size
and pch
adjust size of points and
shape of points respectively.
ggplot(nhanes1518, aes(x = RIDAGEYR, y = BMXBMI)) +
geom_point(size=1, color="dark blue", pch = 20, alpha=0.2) +
labs(title="BMI vs. Age for the nhanes data",
y="Body Mass Index (kg/m**2)",
x="Age (year)")
We introduce linear regression which aims to model the relationship between a response variable and one or more predictor variables by fitting a linear equation to observed data. The goal of linear regression is to find the best-fitting line (or hyperplane) that minimizes the sum of the squared differences between the observed responses and the values predicted by the equation. The resulting equation can then be used to make predictions about the response for new inputs. In essence, linear regression aims to answer the question of how changes in the independent variables relate to changes in the dependent variable.
Simple linear regressions take the form
\[Y_i = \beta_0 +\beta_1 X_i
+\epsilon_i\] Where \(Y_i\) is
the dependent variable (also named as response variable), \(X_i\) is the independent variable (also
named as predictors or explanatory variables), and
$\epsilon_i$
is the random error term.
The assumptions of linear regression model are as follows:
We will verify that our data meets all the assumptions above in Section: Model Diagonostics and Evaluations.
We’ll start with a fitting a simple linear model using the
lm()
function. In the lm()
function, the first
variable is the response variable and the variables to the right of the
~
symbol are the predictor variable(s). Here we use BMI as
the response, and age as the predictor variables.
lm.fit <- lm(BMXBMI ~ RIDAGEYR, data = nhanes1518)
There are several ways that we can examine the model results. The
summary()
function gives a more extensive overview of the
model fit:
summary(lm.fit)
##
## Call:
## lm(formula = BMXBMI ~ RIDAGEYR, data = nhanes1518)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.889 -5.076 -1.189 3.708 56.912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.664351 0.194587 147.309 < 2e-16 ***
## RIDAGEYR 0.017809 0.003727 4.778 1.79e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.261 on 11094 degrees of freedom
## (752 observations deleted due to missingness)
## Multiple R-squared: 0.002053, Adjusted R-squared: 0.001963
## F-statistic: 22.83 on 1 and 11094 DF, p-value: 1.795e-06
\(\beta_0\): Not interpretable in our case. For a person with 0 age (which falls outside of the domain of our model since we only consider adults), his or her expected BMI is 28.664 (which we refer to as extrapolation since the value of the predictor variable falls outside of our domain of concern).
\(\beta_1\): For every unit increase (a year) in the age of a person, his or her BMI is expected to increase by 0.018 on average.
p value: The p value tells us how likely the data we have observed is to have occurred under the null hypothesis (more material on Null hypothesis on subsequent tutorials), i.e. that there is no correlation between the predictor variable age and the response BMI. From the model above, we have a p value of less than 1.79e-06, which tells us that the predictor variable age is statistically significant.
Extrapolation and Data Centering
In the previous example, the intercept is interpreted as the average BMI of an individual at age 0, which does not make sense. Since in reality, it is meaningless to report the BMI of an individual at age 0. This issue is called Extrapolation. Extrapolation occurs if we use a model to make inference or prediction for predictors taking values out of the range (age = 0 in this example).
To make the intercept interpretable, a easy way is to center the age. In R, mean centering variables involves subtracting the mean value of each variable from all the observations in that variable. This can be useful in various statistical analyses, such as regression analysis, where mean centering can improve the interpretation and accuracy of the results. In this case, we center age at 50 for the purpose of simplifying interpretations:
nhanes1518$RIDAGEYR <- nhanes1518$RIDAGEYR-18
lm.fit <- lm(BMXBMI ~ RIDAGEYR, data = nhanes1518)
summary(lm.fit)
##
## Call:
## lm(formula = BMXBMI ~ RIDAGEYR, data = nhanes1518)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.889 -5.076 -1.189 3.708 56.912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.984906 0.133970 216.354 < 2e-16 ***
## RIDAGEYR 0.017809 0.003727 4.778 1.79e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.261 on 11094 degrees of freedom
## (752 observations deleted due to missingness)
## Multiple R-squared: 0.002053, Adjusted R-squared: 0.001963
## F-statistic: 22.83 on 1 and 11094 DF, p-value: 1.795e-06
The coefficients of the linear regression model can be extracted
using the coef()
function and the confidence interval(s)
with the confint()
function.
coef(lm.fit)
## (Intercept) RIDAGEYR
## 28.98490649 0.01780863
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 28.72230148 29.24751150
## RIDAGEYR 0.01050238 0.02511488
We can visualize the regression line by setting the argument
method="lm"
within the geom_smooth()
function.
ggplot(nhanes1518, aes(x = RIDAGEYR, y = BMXBMI)) +
geom_smooth(method = "lm", formula = y ~ x, colour = "red") + # specify the regression formula
geom_point(size=1, color="dark blue", pch = 20, alpha=0.2) +
labs(title="BMI vs. Age for the nhanes data", y="Body Mass Index (kg/m**2)",x="Age (year)")
We can use the predict()
function to obtain prediction
intervals or confidence intervals for a given value of the predictor
variable, RIDAGEYR
. Note that when using the
predict()
function, the column names and format of the new
points at which to predict needs to be the same as the original data
frame used to fit the lm()
model. If you encounter errors
using the predict()
function, this is a good first thing to
check.
Notice that we centered age at 18 in previous model fitting. Therefore, we also need to center the age at 18 when we use the model to make prediction.
predict(lm.fit, data.frame(RIDAGEYR = (c(18, 30, 60)-18)), interval = "confidence")
## fit lwr upr
## 1 28.98491 28.72230 29.24751
## 2 29.19861 29.00583 29.39139
## 3 29.73287 29.57498 29.89076
predict(lm.fit, data.frame(RIDAGEYR = (c(18, 30, 60)-18)), interval = "prediction")
## fit lwr upr
## 1 28.98491 14.74986 43.21995
## 2 29.19861 14.96468 43.43254
## 3 29.73287 15.49937 43.96637
Prediction and confidence interval are both statistical concepts that are used to estimate or quantify uncertainty in a particular outcome or parameter. However, they have different meanings and interpretations.
Confidence interval: a range of values that is likely to contain the true value of a parameter with a certain degree of confidence. For example, if you are estimating the mean BMI of a population based on their age, a 95% confidence interval means that if we were to repeat the sampling process many times and compute the confidence interval each time, we would expect the true population parameter to lie within the interval in 95% of the cases. The width of the confidence interval reflects the uncertainty in the estimation, with wider intervals indicating more uncertainty.
Prediction: An estimate of a specific value or outcome based on the statistical model. For example, in our case when we use a linear regression model to predict the BMI of a person based on their age, a prediction would be the estimated BMI for a person of a specific age. Unlike a confidence interval which provides a range of values for the parameter of a specific population, a prediction interval provides a range of values for the actual observation or outcome.
log.lm.fit <- lm(log(BMXBMI) ~ RIDAGEYR, data = nhanes1518)
log.lm.fit
##
## Call:
## lm(formula = log(BMXBMI) ~ RIDAGEYR, data = nhanes1518)
##
## Coefficients:
## (Intercept) RIDAGEYR
## 3.3310057 0.0008693
As we see the possibility of violation of normality assumptions in diagnostic tests, we can log transform the response BMI variable. We can log transform the response variable under various scenarios: - Normalization: Logarithmic transformations can help make the data more symmetric or approximately normal. Many statistical methods, including linear regression, assume that the response variable follows a normal distribution. By taking the logarithm, you may transform data that is right-skewed (positively skewed) into a more symmetric form. This can lead to more accurate parameter estimates and better model performance. - Homoscedasticity: Log transformations can stabilize the variance of the data. When the variance of the response variable is not constant across different levels of predictors (heteroscedasticity), it can lead to problems in regression analysis. Log-transforming the response can help equalize variances, resulting in more consistent residuals and better model fit. - Outlier Handling: Log transformations can reduce the impact of extreme outliers, which can disproportionately influence model estimates and predictions. By compressing the range of extreme values, log transformations can make the data less sensitive to outliers.
Interpretation for model coefficients: For each one year unit increase in the age, we expect the BMI to increase by a factor of \(e^{0.0008693}\) or 1.00087.
We conduct model diagnosis mainly based on residuals. The
plot()
function provides a convenient way to create four
diagnostic plots. We use the par()
function to arrange the
four plots in 2 rows and 2 columns. We will show later each plot can
also be manually created. For now, we use the plot()
function below to demonstrate its usage:
(You can also use gglm package to draw nicer plots.)
par(mfrow = c(2, 2))
plot(lm.fit)
The diagnostic plots show residuals in four different ways:
Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good. The model we fitted shows roughly a linear relationship, with no distinct patterns (such as a fan or funnel shape) in the residuals vs. fitted plot.
Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line. The Q-Q plot generally follows the straight dashed line, with some deviations at the end towards high values of theoretical quantiles.
Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.
Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. If there is any point outside of the boundary indicated by red dash line, those points will be influential points. Based on the residuals vs. leverage plot, there are no influential points according to Cook’s distance. However, there might be some points with high standard residuals values which could be marked as outliers. For example, 8723th, 10070th, 11734th observations are marked as outliers.
We can reconstruct the above four plots by ourselves. For example, we
can use the residuals()
and rstudent()
functions to extract the residuals and studentized residuals,
respectively, from the linear model and plot them along with the
predicted values.
plot(predict(lm.fit), residuals(lm.fit))
Some metrics and hypothesis tests for model evaluations:
Multiple linear regression allows to evaluate the relationship
between two variables, while controlling the potential effect (i.e.,
removing the effect) of other variables. The lm()
function
can also fit multivariate linear regression. In this section, we will
explore the relation between response variable BMXBMI
and
other predictors RIDAGEYR
and BMXWAIST
.
lm.fit <- lm(BMXBMI ~ RIDAGEYR + BMXWAIST, data = nhanes1518)
summary(lm.fit)
##
## Call:
## lm(formula = BMXBMI ~ RIDAGEYR + BMXWAIST, data = nhanes1518)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9158 -1.7720 -0.2756 1.4522 26.6711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.832041 0.160690 -48.74 <2e-16 ***
## RIDAGEYR -0.060291 0.001503 -40.11 <2e-16 ***
## BMXWAIST 0.391700 0.001619 241.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.766 on 10531 degrees of freedom
## (1314 observations deleted due to missingness)
## Multiple R-squared: 0.8478, Adjusted R-squared: 0.8477
## F-statistic: 2.932e+04 on 2 and 10531 DF, p-value: < 2.2e-16
RIDAGEYR
is -0.06, which means that for every
unit increase in the participant’s age, the BMI is expected to increase
by -0.06 on average, holding all else constant (holding all other
predictor variables, BMXWAIST
constant). Notice here the
impact of age is different from the one in the simple linear regression,
since we control the potential factor, namely
BMXWAIST
.An alternative ways to specify the model is to create a data frame with selected variables and use a dot to include all variables in the pre-defined data frame.
# we can use select to filter the variables of interest
nhanes_core<-nhanes1518 %>% select(BMXBMI, RIDAGEYR, BMXWAIST)
# In the lm() formula, a dot . can be used to include all variables in the NHANES data as predictors.
lm.fit1 <- lm(BMXBMI ~ ., data = nhanes_core)
# If we want to exclude specific variables from the list of predictors, we can use the `-` notation.
# Including `-1` excludes the intercept from the model.
lm.fit1 <- lm(BMXBMI ~ .- 1, data = nhanes_core)
# Exclude BMXWAIST from the model
lm.fit1 <- lm(BMXBMI ~ .- BMXWAIST, data = nhanes_core)
When including multiple predictors into a regression model, some predictors may be highly correlated with each other (known as multicollinearity). Multicollinearity among the predictors impedes model inference and prediction. The standard errors for our regression coefficients will inflate, and thus we will lose precision in our estimates of the regression coefficients. Therefore, apart from diagnostics for simple linear regression, we also need to perform multicollinearity checks for multiple linear regression:
library(car)
# Calculate VIFs
vifs <- vif(lm.fit)
# Print VIFs
print(vifs)
## RIDAGEYR BMXWAIST
## 1.047592 1.047592
Variance Inflation Factor (VIF): The VIF is a measure of the increase in the variance of the estimated coefficients due to multicollinearity. A VIF value greater than 5 indicates that there is strong multicollinearity in the model. Here the multicollinearity issue is not significant since the VIF value is smaller than 5.
Apart from continuous predictors such as age, we may also have categorical variable as our predictors.
For example, if we explore the effect of income on the response BMI.
Income is stored as 1, 2, ..., 15, 77, 99
, a
categorical predictor variable, in the dataset. Specially,
values 77
and 99
represent
refused to answer
and don't know the answer
in
the survey. The encoding of income categories can be found in CDC
website.
In a regression model, a categorical variable with k levels can be represented by k-1 dummy variables (also called indicator variables), with each dummy variable representing one of the levels of the categorical variable, and the omitted reference level represented by a constant term in the regression equation.
For example, if we want to have bmi~income
, our model
will be written as
\[Y_i = \beta_0 +\beta_1 I(Income_1)_i + ... + \beta_{k-1} I(Income_{k-1})_i +\epsilon_i\]
An indicator variable \(I(Income_j)\) takes values 1 or 0,
By omitting one level of the categorical variable (called reference level), we ensure that the sum of the k-1 dummy variables is equal to one for each observation, which eliminates the problem of perfect multicollinearity. This also allows us to interpret the coefficients of the dummy variables as the difference between the level of interest and the omitted reference level, holding all other variables in the model constant.
In R, we only need to mutate categorical variables as
factors using as.factor()
function in the
data frame. Once we put the income as factors into the lm
model, the lm
function in R will help us to dummy coding
automatically. By default, the first factor level of the income category
coded will be the baseline.
nhanes_income <- nhanes1518 %>%
select(BMXBMI, RIDAGEYR,INDHHIN2) %>% # select variables of interest
rename(income=INDHHIN2) %>% # rename the variable as `income`
filter(!income %in%c("77","99")) %>% # first drop categories with values 77 (Refused) and 99 (Don't Know)
mutate(income=as.factor(income)) # specify the income as factor
head(nhanes_income) # fct indicate the type of income is factor
## # A tibble: 6 × 3
## BMXBMI RIDAGEYR income
## <dbl> <dbl> <fct>
## 1 27.8 44 10
## 2 30.8 35 4
## 3 28.8 60 5
## 4 42.4 38 10
## 5 20.3 24 7
## 6 28.6 54 14
Then we fit the linear regression on categorical variable
income
and control the potential influence from age:
lm_category<-lm(BMXBMI ~ income + RIDAGEYR, data = nhanes_income)
lm_category%>%summary()
##
## Call:
## lm(formula = BMXBMI ~ income + RIDAGEYR, data = nhanes_income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.503 -5.084 -1.205 3.734 55.902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.68145 0.43944 65.268 < 2e-16 ***
## income2 1.02975 0.57329 1.796 0.07249 .
## income3 0.45855 0.52584 0.872 0.38321
## income4 0.25289 0.51315 0.493 0.62215
## income5 0.49441 0.51235 0.965 0.33458
## income6 0.79872 0.47927 1.667 0.09564 .
## income7 0.79401 0.48240 1.646 0.09981 .
## income8 0.90406 0.49801 1.815 0.06950 .
## income9 1.34468 0.51271 2.623 0.00874 **
## income10 0.20230 0.53114 0.381 0.70330
## income12 -0.03653 0.56393 -0.065 0.94835
## income13 1.03508 0.72032 1.437 0.15076
## income14 0.32001 0.48557 0.659 0.50989
## income15 -0.66837 0.46010 -1.453 0.14635
## RIDAGEYR 0.01599 0.00394 4.059 4.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.25 on 10175 degrees of freedom
## (1167 observations deleted due to missingness)
## Multiple R-squared: 0.008594, Adjusted R-squared: 0.00723
## F-statistic: 6.3 on 14 and 10175 DF, p-value: 9.565e-13
Baseline: income category 1 corresponding to a household income of 0 to 4,999 dollars.
Model Interpretation:
income6
: The coefficient for the predictor
income6
is 0.799, which means that for participants with
household income category 6 (25,000 to 34,999 dollars per ear), the BMI
is expected to be 0.799 higher than that of participants with household
income in category 1 (0 to 4,999 dollars), on average, holding the age
constant.In the regression in the previous
examplelm(BMXBMI ~ income + RIDAGEYR)
, we assume the impact
of age on BMI is constant across all the income groups, which seems not
true.
The following plot shows the relationship between BMI and age for people with two levels of income, level 1 (annual income of 0-4999 USD) and level 10 (annual income of 65,000 to 74,999 USD), where each level is represented by a different color.
The plot provides a visual representation of how the effect of age on BMI differs for each level of income: For income category 1 with annual income of 0-4999 USD, we observe a steeper slope, which suggests that for each unit increase in BMI, we expect a higher rise in waist circumference if the individual is in income category 1, compared to if the individual is in income category 10. Conversely, for income category 10 with annual income of 65,000 to 74,999 USD, we observe a flatter slope, which suggests that for each unit increase in BMI, we expect a lower rise in age if the individual is in income category 10, compared to if the individual is in income category 1.
# Plot the interaction effects
df<-nhanes_income%>%
filter(income %in%c("1", "10"))
ggplot(df, aes(x = RIDAGEYR, y = BMXBMI, color = income)) +
geom_point(alpha=0.5) +
stat_smooth(method = "lm", se = FALSE, fullrange = TRUE)+
labs(x = "age", y = "BMI") +
ggtitle("Interaction effects between income and age on BMI")
Based on the exploratory data analysis, we learn that the impacts of age and income on the BMI are interactive. Therefore, we need to include interaction terms between these two variables in our regression model.
There are two ways to include interaction terms in the model,
:
and *
. The :
symbol only
includes the interaction term between the two variables, while the
*
symbol includes the variables themselves, as well as the
interaction terms. This means that income*RIDAGEYR
is
equivalent to income + RIDAGEYR + income:RIDAGEYR
.
To interpret the categorical variables, we use k-1 variables to encode, and we noticed that category 1 does not appear in the model thus is the baseline.
lm(BMXBMI ~ income + RIDAGEYR + income:RIDAGEYR, data = nhanes_income)%>%
summary()
##
## Call:
## lm(formula = BMXBMI ~ income + RIDAGEYR + income:RIDAGEYR, data = nhanes_income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.410 -5.065 -1.181 3.735 55.767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.23992 0.71587 38.052 < 2e-16 ***
## income2 2.58318 1.06127 2.434 0.014948 *
## income3 2.76725 0.99618 2.778 0.005482 **
## income4 1.95920 0.90809 2.157 0.030991 *
## income5 2.07901 0.90658 2.293 0.021855 *
## income6 2.70895 0.82884 3.268 0.001085 **
## income7 2.48674 0.82807 3.003 0.002679 **
## income8 2.82906 0.85905 3.293 0.000994 ***
## income9 3.11086 0.90280 3.446 0.000572 ***
## income10 2.26515 0.93762 2.416 0.015716 *
## income12 0.43623 1.01673 0.429 0.667895
## income13 1.72769 1.29744 1.332 0.183018
## income14 1.28959 0.83801 1.539 0.123866
## income15 0.13180 0.79385 0.166 0.868138
## RIDAGEYR 0.07287 0.02265 3.217 0.001298 **
## income2:RIDAGEYR -0.06018 0.03036 -1.982 0.047480 *
## income3:RIDAGEYR -0.07978 0.02801 -2.849 0.004400 **
## income4:RIDAGEYR -0.06468 0.02674 -2.418 0.015614 *
## income5:RIDAGEYR -0.06121 0.02692 -2.274 0.022998 *
## income6:RIDAGEYR -0.07132 0.02521 -2.830 0.004669 **
## income7:RIDAGEYR -0.06528 0.02553 -2.557 0.010575 *
## income8:RIDAGEYR -0.07346 0.02650 -2.772 0.005582 **
## income9:RIDAGEYR -0.06801 0.02785 -2.442 0.014616 *
## income10:RIDAGEYR -0.07849 0.02894 -2.713 0.006688 **
## income12:RIDAGEYR -0.02711 0.02964 -0.914 0.360546
## income13:RIDAGEYR -0.03428 0.03570 -0.960 0.337016
## income14:RIDAGEYR -0.03982 0.02631 -1.514 0.130177
## income15:RIDAGEYR -0.03494 0.02485 -1.406 0.159722
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.246 on 10162 degrees of freedom
## (1167 observations deleted due to missingness)
## Multiple R-squared: 0.01089, Adjusted R-squared: 0.008258
## F-statistic: 4.142 on 27 and 10162 DF, p-value: 3.26e-12
# alternative representation.
lm_int1=lm(BMXBMI ~ income*RIDAGEYR, data = nhanes_income)
# A simple way to include all interaction terms is the syntax `.^2`
lm_int2=lm(BMXBMI ~ .^2, data = nhanes_income)
Model Interpretation
Coefficient of income10
: The BMI of an individual
with age 18 in income category 10 is expected to be 2.265 higher than
the BMI of an individual in income category 1 (baseline
category).
Coefficient of income10:RIDAGEYR
: With one unit
increase of age, the increase of BMI of an individual in income category
10 is expected to be 0.0785 lower than the increase of BMI of an
individual in income category 1.
income10 + intercept
: The BMI of an individual with
age 18 in income category 10 is expected to be 29.505. This corresponds
to the intercept of the blue line in the previous plot.
RIDAGEYR+ income10:RIDAGEYR
: With one year increase
on the age, the BMI of an individual in income category 10 is expected
to decrease by 0.00562(=0.07287-0.07849). This corresponds to the slope
of the blue line in the previous plot.