Topics

  • Utilize a linear regression to describe the relationship between a response variable and several predictors.

  • Estimate the coefficients of the regression line using the least squares method.

  • Interpret the coefficients of the regression line.

  • Make inference and prediction.

  • Model diagnosis.

Library, Dataset, and Motivation

library() function is used to access functionality that is provided by R packages, but is not included in base R.

install.packages() can be used to install new packages.

Run this command from the console.

# install.packages("tidyverse")

First, load the package tidyverse that will be used throughout the tutorial for data visualizations.

library(tidyverse)

This tutorial will be using the nhanes dataset where the variables are described in the file nhanes-codebook.txt. Load this data with the load function and specify the .rda data file.

load(file='nhanes1518.rda')

The functions head() and names() can be used to explore the data.

head() can output the first several rows of the data.

names() can provide all the names of variables.

head(nhanes1518)
## # A tibble: 6 × 52
##    SEQN WTINT2YR WTMEC2YR RIDSTATR RIAGENDR RIDAGEYR INDFMPIR RIDRETH1 DMDEDUC2
##   <int>    <dbl>    <dbl>    <int>    <int>    <int>    <dbl>    <int>    <int>
## 1 83732  134671.  135630.        2        1       62     4.39        3        5
## 2 83733   24329.   25282.        2        1       53     1.32        3        3
## 3 83734   12400.   12576.        2        1       78     1.51        3        3
## 4 83735  102718.  102079.        2        2       56     5           3        5
## 5 83736   17628.   18235.        2        2       42     1.23        4        4
## 6 83737   11252.   10879.        2        2       72     2.82        1        2
## # ℹ 43 more variables: DMDEDUC3 <int>, INDHHIN2 <int>, BMXBMI <dbl>,
## #   BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## #   URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## #   URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## #   URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## #   URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## #   URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …
names(nhanes1518)
##  [1] "SEQN"     "WTINT2YR" "WTMEC2YR" "RIDSTATR" "RIAGENDR" "RIDAGEYR"
##  [7] "INDFMPIR" "RIDRETH1" "DMDEDUC2" "DMDEDUC3" "INDHHIN2" "BMXBMI"  
## [13] "BMXWAIST" "BMXWT"    "BMXHT"    "URXUCR"   "WTSB2YR"  "URXCNP"  
## [19] "URDCNPLC" "URXCOP"   "URDCOPLC" "URXECP"   "URDECPLC" "URXHIBP" 
## [25] "URDHIBLC" "URXMBP"   "URDMBPLC" "URXMC1"   "URDMC1LC" "URXMCOH" 
## [31] "URDMCOLC" "URXMEP"   "URDMEPLC" "URXMHBP"  "URDMHBLC" "URXMHH"  
## [37] "URDMHHLC" "URXMHNC"  "URDMCHLC" "URXMHP"   "URDMHPLC" "URXMIB"  
## [43] "URDMIBLC" "URXMNP"   "URDMNPLC" "URXMOH"   "URDMOHLC" "URXMZP"  
## [49] "URDMZPLC" "WTINT4YR" "WTMEC4YR" "WTSB4YR"

We will illustrate investigating the relationship between age and BMI as an example of linear regression. In the dataset, age and BMI are denoted as RIDAGEYR and BMXBMI respectively.

Exploratory Data Analysis

We are interested in the relation between age and BMI especially for adults. To prepare for data analysis, we apply the filter() function to retain individuals whose age is equal to or greater than 18.

nhanes1518 <- nhanes1518%>%filter(RIDAGEYR>=18)

We present the distribution of age using the geom_histogram() function. binwidth controls the width of bins in the histogram, color and fill control the color of frame and filling respectively.

# Basic histogram
ggplot(nhanes1518, aes(x=RIDAGEYR)) + 
  geom_histogram(binwidth=2, color="black", fill="white")+ 
  labs(x = "Age", title = "Distribution of Age") # legend of the plot

We see that the distribution of age is asymmetric, with peaks at age of around 80 and 0 respectively.

We can also add a vertical line indicating the mean of age, as well as overlay the histogram with a transparent density plot. The value of alpha controls the level of transparency.

# Histogram with density plot
ggplot(nhanes1518, aes(x=RIDAGEYR)) + 
 geom_histogram(binwidth=2, aes(y=..density..), colour="black", fill="white")+
  labs(x = "Age", title = "Density of Age") +
  geom_vline(aes(xintercept=mean(RIDAGEYR)),
            color="blue", linetype="dashed", size=1) + # Add mean line 
   geom_density(alpha=.2, fill="#FF6666") # add a layer of density

Similarly, we can explore the distribution of BMI, which we found to be right-skewed:

# Basic histogram
ggplot(nhanes1518, aes(x=BMXBMI)) + 
  geom_histogram(binwidth=2, color="black", fill="white") + 
  labs(x = "BMI", title = "Distribution of BMI") 

We can also utilize the scatterplot (geom_point) to explore the relationship between age and BMI:

size and pch adjust size of points and shape of points respectively.

ggplot(nhanes1518, aes(x = RIDAGEYR, y = BMXBMI)) + 
    geom_point(size=1, color="dark blue", pch = 20, alpha=0.2) + 
  labs(title="BMI vs. Age for the nhanes data", 
       y="Body Mass Index (kg/m**2)", 
       x="Age (year)")

Linear Regression

We introduce linear regression which aims to model the relationship between a response variable and one or more predictor variables by fitting a linear equation to observed data. The goal of linear regression is to find the best-fitting line (or hyperplane) that minimizes the sum of the squared differences between the observed responses and the values predicted by the equation. The resulting equation can then be used to make predictions about the response for new inputs. In essence, linear regression aims to answer the question of how changes in the independent variables relate to changes in the dependent variable.

Simple Linear Regression

Simple linear regressions take the form

\[Y_i = \beta_0 +\beta_1 X_i +\epsilon_i\] Where \(Y_i\) is the dependent variable (also named as response variable), \(X_i\) is the independent variable (also named as predictors or explanatory variables), and $\epsilon_i$ is the random error term.

  • \(\beta_1\): True slope of the relationship between X and Y
  • \(\beta_0\): True intercept of the relationship between X and Y
  • \(\epsilon\): Error (residual)

Model Assumptions

The assumptions of linear regression model are as follows:

  • Linearity: The relationship between the independent and dependent variables is linear.
  • Independence: The observations are independent of each other, meaning that the value of the dependent variable for one observation is not influenced by the values of the independent variables for other observations.
  • Homoscedasticity: The variance of the errors is constant for all values of the independent variables.
  • Normality: The errors are normally distributed.

We will verify that our data meets all the assumptions above in Section: Model Diagonostics and Evaluations.

Model Fitting

We’ll start with a fitting a simple linear model using the lm() function. In the lm() function, the first variable is the response variable and the variables to the right of the ~ symbol are the predictor variable(s). Here we use BMI as the response, and age as the predictor variables.

lm.fit <- lm(BMXBMI ~ RIDAGEYR, data = nhanes1518)

There are several ways that we can examine the model results. The summary() function gives a more extensive overview of the model fit:

summary(lm.fit)
## 
## Call:
## lm(formula = BMXBMI ~ RIDAGEYR, data = nhanes1518)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.889  -5.076  -1.189   3.708  56.912 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.664351   0.194587 147.309  < 2e-16 ***
## RIDAGEYR     0.017809   0.003727   4.778 1.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.261 on 11094 degrees of freedom
##   (752 observations deleted due to missingness)
## Multiple R-squared:  0.002053,   Adjusted R-squared:  0.001963 
## F-statistic: 22.83 on 1 and 11094 DF,  p-value: 1.795e-06

Model Interpretation

  • \(\beta_0\): Not interpretable in our case. For a person with 0 age (which falls outside of the domain of our model since we only consider adults), his or her expected BMI is 28.664 (which we refer to as extrapolation since the value of the predictor variable falls outside of our domain of concern).

  • \(\beta_1\): For every unit increase (a year) in the age of a person, his or her BMI is expected to increase by 0.018 on average.

  • p value: The p value tells us how likely the data we have observed is to have occurred under the null hypothesis (more material on Null hypothesis on subsequent tutorials), i.e. that there is no correlation between the predictor variable age and the response BMI. From the model above, we have a p value of less than 1.79e-06, which tells us that the predictor variable age is statistically significant.

Extrapolation and Data Centering

In the previous example, the intercept is interpreted as the average BMI of an individual at age 0, which does not make sense. Since in reality, it is meaningless to report the BMI of an individual at age 0. This issue is called Extrapolation. Extrapolation occurs if we use a model to make inference or prediction for predictors taking values out of the range (age = 0 in this example).

To make the intercept interpretable, a easy way is to center the age. In R, mean centering variables involves subtracting the mean value of each variable from all the observations in that variable. This can be useful in various statistical analyses, such as regression analysis, where mean centering can improve the interpretation and accuracy of the results. In this case, we center age at 50 for the purpose of simplifying interpretations:

nhanes1518$RIDAGEYR <- nhanes1518$RIDAGEYR-18
lm.fit <- lm(BMXBMI ~ RIDAGEYR, data = nhanes1518)
summary(lm.fit)
## 
## Call:
## lm(formula = BMXBMI ~ RIDAGEYR, data = nhanes1518)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.889  -5.076  -1.189   3.708  56.912 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.984906   0.133970 216.354  < 2e-16 ***
## RIDAGEYR     0.017809   0.003727   4.778 1.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.261 on 11094 degrees of freedom
##   (752 observations deleted due to missingness)
## Multiple R-squared:  0.002053,   Adjusted R-squared:  0.001963 
## F-statistic: 22.83 on 1 and 11094 DF,  p-value: 1.795e-06
  • Interpret intercept: For a person with 18 age, his or her expected BMI is 28.985. In the rest of the tutorial, we will utilize the age that has been centered at 18.

The coefficients of the linear regression model can be extracted using the coef() function and the confidence interval(s) with the confint() function.

coef(lm.fit)
## (Intercept)    RIDAGEYR 
## 28.98490649  0.01780863
confint(lm.fit)
##                   2.5 %      97.5 %
## (Intercept) 28.72230148 29.24751150
## RIDAGEYR     0.01050238  0.02511488

We can visualize the regression line by setting the argument method="lm" within the geom_smooth() function.

ggplot(nhanes1518, aes(x = RIDAGEYR, y = BMXBMI)) + 
    geom_smooth(method = "lm", formula = y ~ x, colour = "red") + # specify the regression formula
    geom_point(size=1, color="dark blue", pch = 20, alpha=0.2) + 
  labs(title="BMI vs. Age for the nhanes data", y="Body Mass Index (kg/m**2)",x="Age (year)")

Model Prediction

We can use the predict() function to obtain prediction intervals or confidence intervals for a given value of the predictor variable, RIDAGEYR. Note that when using the predict() function, the column names and format of the new points at which to predict needs to be the same as the original data frame used to fit the lm() model. If you encounter errors using the predict() function, this is a good first thing to check.

Notice that we centered age at 18 in previous model fitting. Therefore, we also need to center the age at 18 when we use the model to make prediction.

predict(lm.fit, data.frame(RIDAGEYR = (c(18, 30, 60)-18)), interval = "confidence")
##        fit      lwr      upr
## 1 28.98491 28.72230 29.24751
## 2 29.19861 29.00583 29.39139
## 3 29.73287 29.57498 29.89076
predict(lm.fit, data.frame(RIDAGEYR = (c(18, 30, 60)-18)), interval = "prediction")
##        fit      lwr      upr
## 1 28.98491 14.74986 43.21995
## 2 29.19861 14.96468 43.43254
## 3 29.73287 15.49937 43.96637

Prediction Interval vs Confidence Interval

Prediction and confidence interval are both statistical concepts that are used to estimate or quantify uncertainty in a particular outcome or parameter. However, they have different meanings and interpretations.

  • Confidence interval: a range of values that is likely to contain the true value of a parameter with a certain degree of confidence. For example, if you are estimating the mean BMI of a population based on their age, a 95% confidence interval means that if we were to repeat the sampling process many times and compute the confidence interval each time, we would expect the true population parameter to lie within the interval in 95% of the cases. The width of the confidence interval reflects the uncertainty in the estimation, with wider intervals indicating more uncertainty.

  • Prediction: An estimate of a specific value or outcome based on the statistical model. For example, in our case when we use a linear regression model to predict the BMI of a person based on their age, a prediction would be the estimated BMI for a person of a specific age. Unlike a confidence interval which provides a range of values for the parameter of a specific population, a prediction interval provides a range of values for the actual observation or outcome.

Log Transformation

log.lm.fit <- lm(log(BMXBMI) ~ RIDAGEYR, data = nhanes1518)
log.lm.fit
## 
## Call:
## lm(formula = log(BMXBMI) ~ RIDAGEYR, data = nhanes1518)
## 
## Coefficients:
## (Intercept)     RIDAGEYR  
##   3.3310057    0.0008693

As we see the possibility of violation of normality assumptions in diagnostic tests, we can log transform the response BMI variable. We can log transform the response variable under various scenarios: - Normalization: Logarithmic transformations can help make the data more symmetric or approximately normal. Many statistical methods, including linear regression, assume that the response variable follows a normal distribution. By taking the logarithm, you may transform data that is right-skewed (positively skewed) into a more symmetric form. This can lead to more accurate parameter estimates and better model performance. - Homoscedasticity: Log transformations can stabilize the variance of the data. When the variance of the response variable is not constant across different levels of predictors (heteroscedasticity), it can lead to problems in regression analysis. Log-transforming the response can help equalize variances, resulting in more consistent residuals and better model fit. - Outlier Handling: Log transformations can reduce the impact of extreme outliers, which can disproportionately influence model estimates and predictions. By compressing the range of extreme values, log transformations can make the data less sensitive to outliers.

Interpretation for model coefficients: For each one year unit increase in the age, we expect the BMI to increase by a factor of \(e^{0.0008693}\) or 1.00087.

Model Diagonostics and Evaluations

We conduct model diagnosis mainly based on residuals. The plot() function provides a convenient way to create four diagnostic plots. We use the par() function to arrange the four plots in 2 rows and 2 columns. We will show later each plot can also be manually created. For now, we use the plot() function below to demonstrate its usage:

(You can also use gglm package to draw nicer plots.)

par(mfrow = c(2, 2))
plot(lm.fit)

The diagnostic plots show residuals in four different ways:

  • Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good. The model we fitted shows roughly a linear relationship, with no distinct patterns (such as a fan or funnel shape) in the residuals vs. fitted plot.

  • Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line. The Q-Q plot generally follows the straight dashed line, with some deviations at the end towards high values of theoretical quantiles.

  • Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.

  • Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. If there is any point outside of the boundary indicated by red dash line, those points will be influential points. Based on the residuals vs. leverage plot, there are no influential points according to Cook’s distance. However, there might be some points with high standard residuals values which could be marked as outliers. For example, 8723th, 10070th, 11734th observations are marked as outliers.

We can reconstruct the above four plots by ourselves. For example, we can use the residuals() and rstudent() functions to extract the residuals and studentized residuals, respectively, from the linear model and plot them along with the predicted values.

plot(predict(lm.fit), residuals(lm.fit))

Some metrics and hypothesis tests for model evaluations:

  • \(R^2\): From the model above, we have an adjusted R-squared value of 0.2302, which indicates that 23.02% of the variability in the response variable BMI can be explained by the change in the predictor variable age.
  • p value: The p value tells us how likely the data we have observed is to have occurred under the null hypothesis (more material on Null hypothesis on subsequent tutorials), i.e. that there is no correlation between the predictor variable age and the response BMI. From the model above, we have a p value of 2.2e-16, which tells us that the predictor variable age is statistically significant.

Multivariate Linear Regression

Multiple linear regression allows to evaluate the relationship between two variables, while controlling the potential effect (i.e., removing the effect) of other variables. The lm() function can also fit multivariate linear regression. In this section, we will explore the relation between response variable BMXBMI and other predictors RIDAGEYR and BMXWAIST.

lm.fit <- lm(BMXBMI ~ RIDAGEYR + BMXWAIST, data = nhanes1518)
summary(lm.fit)
## 
## Call:
## lm(formula = BMXBMI ~ RIDAGEYR + BMXWAIST, data = nhanes1518)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9158  -1.7720  -0.2756   1.4522  26.6711 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.832041   0.160690  -48.74   <2e-16 ***
## RIDAGEYR    -0.060291   0.001503  -40.11   <2e-16 ***
## BMXWAIST     0.391700   0.001619  241.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.766 on 10531 degrees of freedom
##   (1314 observations deleted due to missingness)
## Multiple R-squared:  0.8478, Adjusted R-squared:  0.8477 
## F-statistic: 2.932e+04 on 2 and 10531 DF,  p-value: < 2.2e-16

Model Interpretation

  • Intercept: For a person with age 18 and body waist circumference of 0, his or her expected BMI is -7.832. (Notice that you may also want to center the BMXWAIST at pre-specified level to make the intercept interpretable.)
  • \(\beta_{age}\): The coeffcient for the predictor RIDAGEYR is -0.06, which means that for every unit increase in the participant’s age, the BMI is expected to increase by -0.06 on average, holding all else constant (holding all other predictor variables, BMXWAIST constant). Notice here the impact of age is different from the one in the simple linear regression, since we control the potential factor, namely BMXWAIST.

An alternative ways to specify the model is to create a data frame with selected variables and use a dot to include all variables in the pre-defined data frame.

# we can use select to filter the variables of interest
nhanes_core<-nhanes1518 %>% select(BMXBMI, RIDAGEYR, BMXWAIST)
# In the lm() formula, a dot . can be used to include all variables in the NHANES data as predictors.
lm.fit1 <- lm(BMXBMI ~ ., data = nhanes_core)
# If we want to exclude specific variables from the list of predictors, we can use the `-` notation. 
# Including `-1` excludes the intercept from the model.
lm.fit1 <- lm(BMXBMI ~ .- 1, data = nhanes_core)
# Exclude BMXWAIST from the model
lm.fit1 <- lm(BMXBMI ~ .- BMXWAIST, data = nhanes_core)

Multicollinearity Diagnostics

When including multiple predictors into a regression model, some predictors may be highly correlated with each other (known as multicollinearity). Multicollinearity among the predictors impedes model inference and prediction. The standard errors for our regression coefficients will inflate, and thus we will lose precision in our estimates of the regression coefficients. Therefore, apart from diagnostics for simple linear regression, we also need to perform multicollinearity checks for multiple linear regression:

library(car)
# Calculate VIFs
vifs <- vif(lm.fit)

# Print VIFs
print(vifs)
## RIDAGEYR BMXWAIST 
## 1.047592 1.047592

Variance Inflation Factor (VIF): The VIF is a measure of the increase in the variance of the estimated coefficients due to multicollinearity. A VIF value greater than 5 indicates that there is strong multicollinearity in the model. Here the multicollinearity issue is not significant since the VIF value is smaller than 5.

Categorical Variable

Apart from continuous predictors such as age, we may also have categorical variable as our predictors.

For example, if we explore the effect of income on the response BMI. Income is stored as 1, 2, ..., 15, 77, 99, a categorical predictor variable, in the dataset. Specially, values 77 and 99 represent refused to answer and don't know the answer in the survey. The encoding of income categories can be found in CDC website.

In a regression model, a categorical variable with k levels can be represented by k-1 dummy variables (also called indicator variables), with each dummy variable representing one of the levels of the categorical variable, and the omitted reference level represented by a constant term in the regression equation.

For example, if we want to have bmi~income, our model will be written as

\[Y_i = \beta_0 +\beta_1 I(Income_1)_i + ... + \beta_{k-1} I(Income_{k-1})_i +\epsilon_i\]

An indicator variable \(I(Income_j)\) takes values 1 or 0,

  • 1 if the observations belongs to that category j
  • 0 if the observation does not belong to that category j
  • all 0s for all indicator variables if the observation belong to the reference category

By omitting one level of the categorical variable (called reference level), we ensure that the sum of the k-1 dummy variables is equal to one for each observation, which eliminates the problem of perfect multicollinearity. This also allows us to interpret the coefficients of the dummy variables as the difference between the level of interest and the omitted reference level, holding all other variables in the model constant.

In R, we only need to mutate categorical variables as factors using as.factor() function in the data frame. Once we put the income as factors into the lm model, the lm function in R will help us to dummy coding automatically. By default, the first factor level of the income category coded will be the baseline.

nhanes_income <- nhanes1518 %>%
  select(BMXBMI, RIDAGEYR,INDHHIN2) %>% # select variables of interest
  rename(income=INDHHIN2) %>% # rename the variable as `income`
  filter(!income %in%c("77","99")) %>% # first drop categories with values 77 (Refused) and 99 (Don't Know)
  mutate(income=as.factor(income)) # specify the income as factor
head(nhanes_income) # fct indicate the type of income is factor
## # A tibble: 6 × 3
##   BMXBMI RIDAGEYR income
##    <dbl>    <dbl> <fct> 
## 1   27.8       44 10    
## 2   30.8       35 4     
## 3   28.8       60 5     
## 4   42.4       38 10    
## 5   20.3       24 7     
## 6   28.6       54 14

Then we fit the linear regression on categorical variable income and control the potential influence from age:

lm_category<-lm(BMXBMI ~ income + RIDAGEYR, data = nhanes_income)
lm_category%>%summary()
## 
## Call:
## lm(formula = BMXBMI ~ income + RIDAGEYR, data = nhanes_income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.503  -5.084  -1.205   3.734  55.902 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.68145    0.43944  65.268  < 2e-16 ***
## income2      1.02975    0.57329   1.796  0.07249 .  
## income3      0.45855    0.52584   0.872  0.38321    
## income4      0.25289    0.51315   0.493  0.62215    
## income5      0.49441    0.51235   0.965  0.33458    
## income6      0.79872    0.47927   1.667  0.09564 .  
## income7      0.79401    0.48240   1.646  0.09981 .  
## income8      0.90406    0.49801   1.815  0.06950 .  
## income9      1.34468    0.51271   2.623  0.00874 ** 
## income10     0.20230    0.53114   0.381  0.70330    
## income12    -0.03653    0.56393  -0.065  0.94835    
## income13     1.03508    0.72032   1.437  0.15076    
## income14     0.32001    0.48557   0.659  0.50989    
## income15    -0.66837    0.46010  -1.453  0.14635    
## RIDAGEYR     0.01599    0.00394   4.059 4.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.25 on 10175 degrees of freedom
##   (1167 observations deleted due to missingness)
## Multiple R-squared:  0.008594,   Adjusted R-squared:  0.00723 
## F-statistic:   6.3 on 14 and 10175 DF,  p-value: 9.565e-13

Baseline: income category 1 corresponding to a household income of 0 to 4,999 dollars.

Model Interpretation:

  • Intercept: The intercept 28.681 means that for people with age 18 and in the baseline income category (income category 1 corresponding to a household income of 0 to 4,999 dollars), the BMI is expected to be 28.681 on average.
  • income6: The coefficient for the predictor income6 is 0.799, which means that for participants with household income category 6 (25,000 to 34,999 dollars per ear), the BMI is expected to be 0.799 higher than that of participants with household income in category 1 (0 to 4,999 dollars), on average, holding the age constant.

Interaction Terms

In the regression in the previous examplelm(BMXBMI ~ income + RIDAGEYR), we assume the impact of age on BMI is constant across all the income groups, which seems not true.

The following plot shows the relationship between BMI and age for people with two levels of income, level 1 (annual income of 0-4999 USD) and level 10 (annual income of 65,000 to 74,999 USD), where each level is represented by a different color.

The plot provides a visual representation of how the effect of age on BMI differs for each level of income: For income category 1 with annual income of 0-4999 USD, we observe a steeper slope, which suggests that for each unit increase in BMI, we expect a higher rise in waist circumference if the individual is in income category 1, compared to if the individual is in income category 10. Conversely, for income category 10 with annual income of 65,000 to 74,999 USD, we observe a flatter slope, which suggests that for each unit increase in BMI, we expect a lower rise in age if the individual is in income category 10, compared to if the individual is in income category 1.

# Plot the interaction effects
df<-nhanes_income%>%
  filter(income %in%c("1", "10"))
ggplot(df, aes(x = RIDAGEYR, y = BMXBMI, color = income)) + 
  geom_point(alpha=0.5) + 
  stat_smooth(method = "lm", se = FALSE, fullrange = TRUE)+
  labs(x = "age", y = "BMI") + 
  ggtitle("Interaction effects between income and age on BMI")

Based on the exploratory data analysis, we learn that the impacts of age and income on the BMI are interactive. Therefore, we need to include interaction terms between these two variables in our regression model.

There are two ways to include interaction terms in the model, : and *. The : symbol only includes the interaction term between the two variables, while the * symbol includes the variables themselves, as well as the interaction terms. This means that income*RIDAGEYR is equivalent to income + RIDAGEYR + income:RIDAGEYR.

To interpret the categorical variables, we use k-1 variables to encode, and we noticed that category 1 does not appear in the model thus is the baseline.

lm(BMXBMI ~ income + RIDAGEYR + income:RIDAGEYR, data = nhanes_income)%>%
  summary()
## 
## Call:
## lm(formula = BMXBMI ~ income + RIDAGEYR + income:RIDAGEYR, data = nhanes_income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.410  -5.065  -1.181   3.735  55.767 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       27.23992    0.71587  38.052  < 2e-16 ***
## income2            2.58318    1.06127   2.434 0.014948 *  
## income3            2.76725    0.99618   2.778 0.005482 ** 
## income4            1.95920    0.90809   2.157 0.030991 *  
## income5            2.07901    0.90658   2.293 0.021855 *  
## income6            2.70895    0.82884   3.268 0.001085 ** 
## income7            2.48674    0.82807   3.003 0.002679 ** 
## income8            2.82906    0.85905   3.293 0.000994 ***
## income9            3.11086    0.90280   3.446 0.000572 ***
## income10           2.26515    0.93762   2.416 0.015716 *  
## income12           0.43623    1.01673   0.429 0.667895    
## income13           1.72769    1.29744   1.332 0.183018    
## income14           1.28959    0.83801   1.539 0.123866    
## income15           0.13180    0.79385   0.166 0.868138    
## RIDAGEYR           0.07287    0.02265   3.217 0.001298 ** 
## income2:RIDAGEYR  -0.06018    0.03036  -1.982 0.047480 *  
## income3:RIDAGEYR  -0.07978    0.02801  -2.849 0.004400 ** 
## income4:RIDAGEYR  -0.06468    0.02674  -2.418 0.015614 *  
## income5:RIDAGEYR  -0.06121    0.02692  -2.274 0.022998 *  
## income6:RIDAGEYR  -0.07132    0.02521  -2.830 0.004669 ** 
## income7:RIDAGEYR  -0.06528    0.02553  -2.557 0.010575 *  
## income8:RIDAGEYR  -0.07346    0.02650  -2.772 0.005582 ** 
## income9:RIDAGEYR  -0.06801    0.02785  -2.442 0.014616 *  
## income10:RIDAGEYR -0.07849    0.02894  -2.713 0.006688 ** 
## income12:RIDAGEYR -0.02711    0.02964  -0.914 0.360546    
## income13:RIDAGEYR -0.03428    0.03570  -0.960 0.337016    
## income14:RIDAGEYR -0.03982    0.02631  -1.514 0.130177    
## income15:RIDAGEYR -0.03494    0.02485  -1.406 0.159722    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.246 on 10162 degrees of freedom
##   (1167 observations deleted due to missingness)
## Multiple R-squared:  0.01089,    Adjusted R-squared:  0.008258 
## F-statistic: 4.142 on 27 and 10162 DF,  p-value: 3.26e-12
# alternative representation.
lm_int1=lm(BMXBMI ~ income*RIDAGEYR, data = nhanes_income)
# A simple way to include all interaction terms is the syntax `.^2`
lm_int2=lm(BMXBMI ~ .^2, data = nhanes_income)

Model Interpretation

  • Coefficient of income10: The BMI of an individual with age 18 in income category 10 is expected to be 2.265 higher than the BMI of an individual in income category 1 (baseline category).

  • Coefficient of income10:RIDAGEYR: With one unit increase of age, the increase of BMI of an individual in income category 10 is expected to be 0.0785 lower than the increase of BMI of an individual in income category 1.

  • income10 + intercept: The BMI of an individual with age 18 in income category 10 is expected to be 29.505. This corresponds to the intercept of the blue line in the previous plot.

  • RIDAGEYR+ income10:RIDAGEYR: With one year increase on the age, the BMI of an individual in income category 10 is expected to decrease by 0.00562(=0.07287-0.07849). This corresponds to the slope of the blue line in the previous plot.