28  Multiple Linear Regression

Author
Affiliations

Noel Singh Dias and Jingwei Gu

Florida International University

Robert Stempel College of Public Health and Social Work

# install.packages("tidyverse")
# install.packages("ggplot2")
# install.packages("readr")
library(tidyverse)
library(ggplot2)
library(readr)

28.1 Multiple Linear Regression

Multiple Linear Regression (MLR) is a statistical technique used to understand the relationship between one dependent variable and two or more independent variables. The relationship is explained by by modeling the observed relationship using a mathematical representation or approximation.

28.1.1 Key Components in MLR:

  1. Dependent Variable (Y): The outcome variable or the variable that is potentially going to change due to influencing factors.

  2. Independent Variables (X1, X2, …, Xn): Predictor or influencing variables used to predict the dependent variable.

  3. Regression Coefficients (β0, β1, …, βn): Parameters that represent the relationship between each independent variable and the dependent variable where β0 is the intercept, β1 to βn are the slopes for each independent variable.

  4. Error Term (ε): Represents the random variability in the dependent variable that might perhaps not be explained by the independent variables.

28.1.2 Model Equation:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p + \varepsilon \]

28.2 Assumptions:

  1. Linearity: it is assumed that the relationship between the dependent and independent variables is linear.
  2. Independence: Observations are independent of each other.
  3. Homoscedasticity: The variance of the residuals (errors) is constant across all levels of the independent variables.
  4. Normality: Residuals are normally distributed.
  5. No Multicollinearity: Independent variables are not highly correlated with each other.

28.3 Purpose:

To predict the value of the dependent variable based on the values of the independent variables and to understand the strength and type of relationships between the dependent variable and multiple independent variables.

28.4 Steps in Conducting MLR:

  1. Data Collection: Gather data for the dependent and independent variables.
  2. Model Specification: Define the model equation with the dependent variable and chosen independent variables.
  3. Estimation of Coefficients: Use statistical software to estimate the regression coefficients.
  4. Model Evaluation: Assess the model’s goodness-of-fit using R-squared, adjusted R-squared, and other metrics.
  5. Diagnostic Checking: Check the assumptions of MLR (linearity, independence, homoscedasticity, normality, no multicollinearity).

28.5 Model Evaluation Metrics:

  1. R-squared (R²): Measures the proportion of variance in the dependent variable explained by the independent variables.
  2. Adjusted R-squared: Adjusted version of R² that accounts for the number of predictors in the model.
  3. F-statistic: Tests the overall significance of the model.
  4. p-values: Test the significance of individual regression coefficients.

28.6 Demostration

We perform a multiple linear regression analysis on a dataset of medical insurance costs. This dataset includes variables such as age, sex, BMI, number of children, smoker status, and region. Our goal is to understand the relationship between these variables and to predict insurance charges based on the other factors.

The data of insurance could be found from Kagglelink

# Load the data
data <- read_csv("../data/04_insurance.csv")
Rows: 1338 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): sex, smoker, region
dbl (4): age, bmi, children, charges

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preview the data
head(data)
# A tibble: 6 × 7
    age sex      bmi children smoker region    charges
  <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
1    19 female  27.9        0 yes    southwest  16885.
2    18 male    33.8        1 no     southeast   1726.
3    28 male    33          3 no     southeast   4449.
4    33 male    22.7        0 no     northwest  21984.
5    32 male    28.9        0 no     northwest   3867.
6    31 female  25.7        0 no     southeast   3757.

28.6.1 Summary Statistics

We begin by examining the summary statistics of the dataset to understand its structure and the distribution of variables.

summary(data)
      age            sex                 bmi           children    
 Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
 1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
 Median :39.00   Mode  :character   Median :30.40   Median :1.000  
 Mean   :39.21                      Mean   :30.66   Mean   :1.095  
 3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
 Max.   :64.00                      Max.   :53.13   Max.   :5.000  
    smoker             region             charges     
 Length:1338        Length:1338        Min.   : 1122  
 Class :character   Class :character   1st Qu.: 4740  
 Mode  :character   Mode  :character   Median : 9382  
                                       Mean   :13270  
                                       3rd Qu.:16640  
                                       Max.   :63770  

we will first log-transform the charges variable. This is often done to stabilize variance and make the data more normally distributed, which can help improve the performance and interpretation of regression models.

# Log-transform the charges variable
data$log_charges <- log(data$charges)

Before fitting the model, we need to ensure that our categorical variables are correctly encoded.

# Convert categorical variables to factors
data$sex <- as.factor(data$sex)
data$smoker <- as.factor(data$smoker)
data$region <- as.factor(data$region)

Before converting to factors:
sex: Character values
“female”, “male”.
smoker: Character values
“no”, “yes”.
region: Character values
“northeast” “northwest” “southeast” “southwest”.

After converting to factors:
sex:
Factor with levels
1 = “female”,
2 = “male”.
smoker:
Factor with levels
1 = “no”,
2 = “yes”.
region:
Factor with levels
1 = “northeast”, 2 = “northwest”, 3 = “southeast”, 4 = “southwest”.

28.6.2 Model Fitting

We fit a multiple linear regression model to predict insurance charges based on the other variables in the dataset.

# Fit the multiple linear regression model
model <- lm(log_charges ~ age + sex + bmi + children + smoker + region,
  data = data)

# Summary of the model
summary(model)

Call:
lm(formula = log_charges ~ age + sex + bmi + children + smoker + 
    region, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.07186 -0.19835 -0.04917  0.06598  2.16636 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.0305581  0.0723960  97.112  < 2e-16 ***
age              0.0345816  0.0008721  39.655  < 2e-16 ***
sexmale         -0.0754164  0.0244012  -3.091 0.002038 ** 
bmi              0.0133748  0.0020960   6.381 2.42e-10 ***
children         0.1018568  0.0100995  10.085  < 2e-16 ***
smokeryes        1.5543228  0.0302795  51.333  < 2e-16 ***
regionnorthwest -0.0637876  0.0349057  -1.827 0.067860 .  
regionsoutheast -0.1571967  0.0350828  -4.481 8.08e-06 ***
regionsouthwest -0.1289522  0.0350271  -3.681 0.000241 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4443 on 1329 degrees of freedom
Multiple R-squared:  0.7679,    Adjusted R-squared:  0.7666 
F-statistic: 549.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Coefficients and Interpretation:

Intercept: The intercept of the model is 7.0305581, which represents the expected log of charges when all predictors are zero. This value is highly significant with a p-value less than 2e-16.

Age: For each additional year of age, the log of charges increases by 0.0345816. This effect is highly significant with a p-value less than 2e-16.

Sex (male): Being male decreases the log of charges by 0.0754164 compared to being female. This effect is significant with a p-value of 0.002038.

BMI: Each unit increase in BMI results in an increase in the log of charges by 0.0133748. This effect is highly significant with a p-value of 2.42e-10.

Number of children: Each additional child increases the log of charges by 0.1018568. This effect is highly significant with a p-value less than 2e-16.

Smoking status (yes): Being a smoker increases the log of charges by 1.5543288. This effect is highly significant with a p-value less than 2e-16.

Region (northwest):Living in the northwest region decreases the log of charges by 0.0637876 compared to the baseline region. This effect is marginally significant with a p-value of 0.067860.

Region (southeast): Living in the southeast region decreases the log of charges by 0.1571967 compared to the baseline region. This effect is highly significant with a p-value of 8.08e-06.

Region (southwest): Living in the southwest region decreases the log of charges by 0.1289522 compared to the baseline region. This effect is significant with a p-value of 0.000241.

The model explains approximately 76.79% of the variance in the log of charges, as indicated by the multiple R-squared value of 0.7679 and the adjusted R-squared value of 0.7666.

The overall model is highly significant, as indicated by the F-statistic of 549.8 with a p-value less than 2.2e-16.

28.7 Conclusion

Our multiple linear regression model suggests that age, sex, BMI, number of children, smoking status, and region are significant predictors of log-transformed insurance charges.

Specifically, older age, higher BMI, more children, and being a smoker are associated with higher log-transformed insurance charges. In contrast, being male and residing in the northwest, southeast, or southwest regions tends to be associated with lower log-transformed insurance charges compared to their respective reference categories.