Building, Diagnosing, and Enhancing Regression Models
Learn how to perform linear and generalized linear modeling in R using lm() and glm(). This expanded tutorial covers model fitting, diagnostics, interpretation, and advanced techniques such as interaction terms and polynomial regression.
Statistical modeling is a cornerstone of data analysis in R. This tutorial will walk you through how to build regression models using lm() for linear regression and glm() for generalized linear models, such as logistic regression. In addition to model fitting, we will cover techniques for diagnosing model performance, interpreting results, and exploring advanced modeling techniques such as interaction terms and polynomial regression.
1. Linear Modeling with lm()
The lm() function fits a linear model by minimizing the sum of squared errors. We’ll start with a simple example using the mtcars dataset to predict miles-per-gallon (mpg) based on car weight (wt).
data("mtcars")lm_model <-lm(mpg ~ wt, data = mtcars)summary(lm_model)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Interpretation
Intercept and Coefficient:
The output shows the intercept and slope (coefficient for wt). For each additional unit of weight, mpg is expected to change by the coefficient value.
R-squared:
Indicates how much of the variability in mpg is explained by the model.
P-values:
Assess the statistical significance of the predictors.
2. Generalized Linear Modeling with glm()
The glm() function extends linear models to accommodate non-normal error distributions. A common use is logistic regression for binary outcomes. Here, we predict transmission type (am, where 0 = automatic, 1 = manual) based on car weight (wt).
glm_model <-glm(am ~ wt, data = mtcars, family = binomial)summary(glm_model)
Call:
glm(formula = am ~ wt, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.040 4.510 2.670 0.00759 **
wt -4.024 1.436 -2.801 0.00509 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 19.176 on 30 degrees of freedom
AIC: 23.176
Number of Fisher Scoring iterations: 6
Interpretation
Coefficients:
Represent log-odds. Use exp(coef(glm_model)) to obtain odds ratios.
Family and Link Function:
The binomial family with a logit link is used for binary outcomes.
3. Model Diagnostics and Validation
After fitting a model, it’s essential to verify that the model assumptions hold.
Understanding your model’s output is critical for drawing meaningful conclusions.
Interpreting lm() Results
Coefficients:
Determine the relationship between predictors and the outcome.
Confidence Intervals:
Use confint(lm_model) to assess uncertainty in estimates.
Visualizing Predictions
library(ggplot2)ggplot(mtcars, aes(x = wt, y = mpg)) +geom_point() +geom_smooth(method ="lm", se =TRUE, color ="blue") +labs(title ="Linear Model: MPG vs. Weight", x ="Weight", y ="MPG")
Interpreting glm() Results
Odds Ratios:
Calculate using exp(coef(glm_model)) to understand the effect on the probability scale.
Predicted Probabilities:
Use predict(glm_model, type = "response") to get probabilities for the binary outcome.
5. Advanced Modeling Techniques
Enhance your models by incorporating additional predictors or transforming variables.
Interaction Terms
Include interactions to explore whether the effect of one predictor depends on another.
interaction_model <-lm(mpg ~ wt * cyl, data = mtcars)summary(interaction_model)
Call:
lm(formula = mpg ~ wt * cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.2288 -1.3495 -0.5042 1.4647 5.2344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.3068 6.1275 8.863 1.29e-09 ***
wt -8.6556 2.3201 -3.731 0.000861 ***
cyl -3.8032 1.0050 -3.784 0.000747 ***
wt:cyl 0.8084 0.3273 2.470 0.019882 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.368 on 28 degrees of freedom
Multiple R-squared: 0.8606, Adjusted R-squared: 0.8457
F-statistic: 57.62 on 3 and 28 DF, p-value: 4.231e-12
Polynomial Regression
Model non-linear relationships by adding polynomial terms.
poly_model <-lm(mpg ~poly(wt, 2), data = mtcars)summary(poly_model)
Call:
lm(formula = mpg ~ poly(wt, 2), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.483 -1.998 -0.773 1.462 6.238
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.0906 0.4686 42.877 < 2e-16 ***
poly(wt, 2)1 -29.1157 2.6506 -10.985 7.52e-12 ***
poly(wt, 2)2 8.6358 2.6506 3.258 0.00286 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.651 on 29 degrees of freedom
Multiple R-squared: 0.8191, Adjusted R-squared: 0.8066
F-statistic: 65.64 on 2 and 29 DF, p-value: 1.715e-11
Model Comparison
Use criteria such as AIC (Akaike Information Criterion) to compare different models.
model1 <-lm(mpg ~ wt, data = mtcars)model2 <-lm(mpg ~ wt + cyl, data = mtcars)AIC(model1, model2)
df AIC
model1 3 166.0294
model2 4 156.0101
Conclusion
This tutorial provided a comprehensive overview of statistical modeling in R using lm() and glm(). You learned how to fit models, diagnose issues, interpret outputs, and enhance your models with advanced techniques. By integrating these methods, you can build robust statistical models that yield insightful and reliable results.
@online{kassambara2024,
author = {Kassambara, Alboukadel},
title = {Statistical {Modeling} with Lm() and Glm() in {R}},
date = {2024-02-10},
url = {https://www.datanovia.com/learn/programming/r/data-science/statistical-modeling-with-lm-and-glm.html},
langid = {en}
}