37 Introduction to Structural Equation Modeling

Author

Affiliations

Ekpereka Sandra Nawfal

Florida International University

Robert Stempel College of Public Health and Social Work

Published

July 30, 2024

37.1 What is SEM?

Structural Equation Modeling (SEM) is a robust statistical approach that allows researchers to examine complex relationships between variables. It provides a theoretical framework that effectively combine series of statistical analyses techniques, including regression analysis, path analysis, and factor analysis, to test hypotheses of complex relationships in our data.

37.1.1 Advantages of using SEM

Simultaneous Analysis: SEM can analyze multiple variables and their relationships simultaneously, unlike traditional regression, which typically handles one dependent variable at a time.
Latent Variables: SEM allows the inclusion of latent variables in the analysis. Testing the hypotheses using the latent variables rather than the observed sum-scores of the indicator, frees the analysis of measurement error because the error are estimated and removed. SEM has the ability to test construct-level hypotheses at a construct level.
Flexibility: It can model complex and multidimensional relationships, including mediation and moderation effects.
Improved statistical estimation and remove/reduces random error.

37.1.2 Terminology in SEM

Before we proceed with the presentation, let’s briefly define some of the terms typically used in SEM context.

observed/measured/indicator/manifest variable: a variable that exists in the data. Example, items in the questionnaire for measuring the latent variable.
latent/factor variable: variable that are not directly measured. They are constructed and does not exist in the data.
exogenous variable: This is the independent variable. It is either observed or latent variable that influence the endogenous variable.
endogenous variable: analogous to dependent variables used in traditional analyses, representing outcomes or effects. It either observed or latent variable that has a causal path leading to it.
Disturbances: refers to residuals

37.2 Fundamental Mathematical Components of SEM Model

Measurement Model: The measurement model is the part of SEM that specifies the relationship between the indicator/measured/observed variable and the latent/unobserved construct. It is typically validated using Confirmatory Factor Analysis (CFA).It is the bases for evaluating the adequacy of the measurement quality of the underlying construct.
Structural Model: The structural model specifies the causal relationships between the latent variables or other variables. It represents the hypothesized directions of influence and the strength of these relationships.

37.2.1 Exogenous measurement variable:

Exogenous measurement variable: $ $

$\mathbf{x} =(x_1, \cdots, x_q)’$: Vector of exogenous indicators.
$\mathbf{\tau_x}$: vector of intercepts for exogenous indicators
$\mathbf{\Lambda_x}$: Matrix of factor loadings corresponding to the latent exogenous variables.
$\mathbf{\xi}$: Vector of latent exogenous variables.
$\mathbf{\delta}= ( \delta_1, \cdots, \delta_q)’$: Vector of residuals for exogenous variables. #- $\mathbf{\theta_{\delta}}$: Variance or covariance of residuals for exogenous indicators.#

37.2.2 Endogenous measurement variable:

Endogenous measurement variable: $ $

$\mathbf{y} = (y_1, \cdots, y_p)’$: Vector of endogenous indicators.
$\mathbf{\tau_y}$: vector of intercepts for endogenous indicators
$\mathbf{\Lambda_y}$: Matrix of factor loadings corresponding to the latent endogenous variables.
$\mathbf{\eta}$: Vector of latent endogenous variables.
$\mathbf{\epsilon}= ( \epsilon_1, \cdots,\epsilon_p)’$: Vector of residuals for endogenous variables.
$\mathbf{\theta_{\epsilon}}$: Variance or covariance of residuals for endogenous indicators.

37.2.3 Structural equation definition

Structural variable: $ $

$\mathbf{\alpha}$: a vector of intercepts.
$B$: Matrix of regression coefficients for the endogenous variables.
$\Gamma$: Matrix of regression coefficients for the exogenous variables.
$ξ$: Vector of latent exogenous variables.#
$\zeta= ( \zeta_1, \cdots, \zeta_m)’$: Vector of disturbances (residuals) for endogenous variables.

Note: The structural regression links the measurement and structural models

37.3 Assumptions in SEM

Linearity: SEM assumes that the relationship between endogenous (dependent) and exogenous (independent) variables are linear. This can be assessed using scatter plots and residual analysis.
Normality: Indicators for latent variables follow a normal distribution. This can be checked by estimating skewness and kurtosis (skewness ≤ 2 and kurtosis ≤ 7).
Multicollinearity: SEM assumes that Predictors are not perfectly correlated. We can Assess multicollinearity among the indicators using Variance Inflation Factor (VIF).
Missing Data: variables in study should be complete. Examine and handle missing data.

37.4 Intalling required packages

First I will begin the SEM analysis by installing and loading the R packages that I will utilize in the example.

#Install.packages("lavaan") #to estimate the SEM model
#install.packages("semPlot") #for plotting the path diagram
#install.packages("seminr") # contains the dataset for the analysis
#install.packages("car") #use to compute Variance Inflation Factors (VIF) for multicollinearity check
#install.packages("MVN") # for multivariate normality check
#install.packages("tidyverse")
library(lavaan)
library(seminr)
library(semPlot)
library(ggplot2) 
library(MVN)
library(car)
#library(corrplot)
library(tidyverse)

37.5 Data source and description

For our demonstration, we will be using the mobi dataframe found in the seminr package. The dataset is used as measurement instrument for the European Customer Satisfaction Index (ECSI) adapted to the mobile phone market. The data contain 250 observations of 24 latent variables. Our research question: Does customer’s perceived quality of services impact their satisfaction with the product?

The variables of interest for the analysis are:

Exogenous variable: Perceived Quality measured with a 7-item survey question.

PERQ1 - Overall perceived quality
PERQ2 - Technical quality of the network
PERQ3 - Customer service and personal advice offered
PERQ4 - Quality of the services you use
PERQ5 - Range of services and products offered
PERQ6 - Reliability and accuracy of the products and services provided
PERQ7 - Clarity and transparency of information provided

Endogenous variable: Satisfaction with phone service provider measured with a 3-item survey questions.

CUSA1 - Overall satisfaction
CUSA2 - Fulfillment of expectations
CUSA3 - How well do you think "your mobile phone provider" compares with your ideal mobile phone provider?

37.5.1 Conceptual diagram of the demonstration SEM model

To explain further on the relationship of the variables included in this SEM demonstration, a conceptual diagram of the model is shown below:

Fig.1: Observed/indicator/item/measured variables are depicted by rectangles, latent/unobserved variables are represented by ovals, exogenous variables sends a one-headed arrow to other variables, exogenous variable receives one-headed arrows from other variables, variance are depicted with two-headed arrow from the variable to itself and covariance (not specified in this diagram) are depicted with two-headed arrow from one variable to another.

37.5.2 Steps in SEM

Model Specification: Define the theoretical model you want to test, that is, the hypothesized relationships between variables (fig. 1)
Model Identification: It involves checking if there is enough information with the available data to estimate the parameters. An identified model is one that is estimable. Your analysis will not run if your model is not identified. The degree of freedom of the model tells us if the model is under-identified, exactly/just-identified or over-identified.
Parameter Estimation: comparing actual and estimated covariance (i.e., maximum likelihood estimate)
Model Evaluation: Assess how well the model fits the data using various goodness of fit indices (i.e., Chi-square test, Root Mean Square Error of Approximation (RMSEA), Comparative Fit Index (CFI), Tucker Lewis Index (TLI), e.t.c.)

37.5.2.1 Notes on Model Identification

The goal is to maximize the degrees of freedom (df).

degrees of freedom $< 0$ (under-identified, bad)
degrees of freedom $= 0$ (just identified or saturated, neither bad nor good)
degrees of freedom $> 0$ (over-identified, good)

Model degrees of freedom (df) is calculated using the formula: \[ \mbox{df} = \mbox{number of known values } – \mbox{ number of free parameters} \] Where:

$p$ = number of observed variables (items in survey)
$p(p+1)/2$ is the number of known values
$\mbox{number of free parameters}$ = number of (unique) model parameters minus the number of fixed parameters

For example, to calculate the degree of freedom of our SEM model:

number of observed variables ($p$) = 10
number of free parameters = $23 - 2 = 21$

Therefore, \[ \mbox{df} = 10(10+1)/2 – \mbox{21} = 55 - 21 = 34. \] Since \[ \mbox{df} = 34 \ ({df} > 0), \] our model is overidentified.

37.5.2.2 Criteria for Model Fit Evaluation

Cutoff criteria of the common fit indices:

Model chi-square ($\chi^2$): We want to observe a non-significant chi-square $p > .05$. This indicates good fit. However, Model chi-square is highly sensitive sample size. So, it is often not considered as a reasonable measure of fit, especially with large sample.
CFI and TLI: values greater than 0.90, conservatively 0.95 indicate good fit
RMSEA:
- $\le 0.05$ : good-fit
- 0.05 - 0.08 : reasonable approximate fit
- $>= 0.10$ : poor fit
Model Modification: Post hoc model modification indexes gives suggestions about ways to adjust the model if necessary based on the fit indices and theoretical considerations.

37.6 Performing the SEM

#| label: datasets 
#| message: false
#| warning: false

# retrieve the dataset from the seminr package
data("mobi")

# View documentation
#help(mobi)

#colnames(mobi)
#head(mobi)

#create a new dataset that contains only variable needed in for our analysis 

assum_dat <- mobi %>% 
  select(PERQ1:PERQ7, CUSA1:CUSA3)

# Check for missing values in the entire data frame
is_na <- is.na(assum_dat)
missing_per_column <- colSums(is_na)
print(missing_per_column)

PERQ1 PERQ2 PERQ3 PERQ4 PERQ5 PERQ6 PERQ7 CUSA1 CUSA2 CUSA3 
    0     0     0     0     0     0     0     0     0     0

# Specifying the model
model1.fit <- "
# measurement model
Quality =~ PERQ1 + PERQ2 + PERQ3 + PERQ4 + PERQ5 + PERQ6 + PERQ7
Satisfaction =~ CUSA1 + CUSA2 + CUSA3 

#structural model
Satisfaction ~ Quality
"
# model estimation and identification of model fit
 fit_sem <- sem(
   model1.fit, 
   data=assum_dat
 )

#use the summary function to produce the result summaries
 summary(fit_sem, 
         fit.measures = T, #include the fit indices
         standardized=TRUE
 )

lavaan 0.6.17 ended normally after 26 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        21

  Number of observations                           250

Model Test User Model:
                                                      
  Test statistic                               105.123
  Degrees of freedom                                34
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                              1286.661
  Degrees of freedom                                45
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.943
  Tucker-Lewis Index (TLI)                       0.924

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -4176.834
  Loglikelihood unrestricted model (H1)      -4124.272
                                                      
  Akaike (AIC)                                8395.668
  Bayesian (BIC)                              8469.619
  Sample-size adjusted Bayesian (SABIC)       8403.047

Root Mean Square Error of Approximation:

  RMSEA                                          0.091
  90 Percent confidence interval - lower         0.072
  90 Percent confidence interval - upper         0.112
  P-value H_0: RMSEA <= 0.050                    0.000
  P-value H_0: RMSEA >= 0.080                    0.840

Standardized Root Mean Square Residual:

  SRMR                                           0.045

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Quality =~                                                            
    PERQ1             1.000                               1.093    0.770
    PERQ2             0.995    0.109    9.103    0.000    1.087    0.576
    PERQ3             1.244    0.102   12.209    0.000    1.359    0.747
    PERQ4             1.084    0.093   11.670    0.000    1.185    0.719
    PERQ5             0.905    0.082   10.983    0.000    0.989    0.682
    PERQ6             1.054    0.092   11.468    0.000    1.151    0.708
    PERQ7             1.278    0.103   12.429    0.000    1.396    0.759
  Satisfaction =~                                                       
    CUSA1             1.000                               0.865    0.702
    CUSA2             1.548    0.141   10.983    0.000    1.339    0.760
    CUSA3             1.510    0.139   10.839    0.000    1.306    0.749

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Satisfaction ~                                                        
    Quality           0.759    0.070   10.838    0.000    0.959    0.959

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .PERQ1             0.819    0.086    9.542    0.000    0.819    0.407
   .PERQ2             2.381    0.224   10.628    0.000    2.381    0.668
   .PERQ3             1.460    0.150    9.761    0.000    1.460    0.442
   .PERQ4             1.314    0.132    9.984    0.000    1.314    0.484
   .PERQ5             1.126    0.110   10.211    0.000    1.126    0.535
   .PERQ6             1.320    0.131   10.056    0.000    1.320    0.499
   .PERQ7             1.436    0.149    9.655    0.000    1.436    0.424
   .CUSA1             0.768    0.079    9.669    0.000    0.768    0.507
   .CUSA2             1.312    0.146    8.962    0.000    1.312    0.423
   .CUSA3             1.335    0.146    9.126    0.000    1.335    0.439
    Quality           1.194    0.170    7.004    0.000    1.000    1.000
   .Satisfaction      0.060    0.035    1.691    0.091    0.080    0.080

37.7 Check Assumptions

#| label: datasets 
#| message: false
#| warning: false

# (1) Linearity:

#First, we extract the factors scores of the latent variables  

 fac_scores <- data.frame(
   lavPredict(fit_sem)
 )

#Plotting the scatter plot with the using extracted factor scores 

 with(
   fac_scores,
   plot(Quality,Satisfaction)
 )

#test linearity using ggplot

  ggplot(fac_scores, 
    aes(Quality,Satisfaction)) + 
      geom_point(size = 1) +
        geom_smooth(method=`loess`
    )

#2. # Check multivariate normality 

 mvn(assum_dat, 
     mvnTest = "mardia")

$multivariateNormality
             Test        Statistic               p value Result
1 Mardia Skewness 1208.97674897777 3.73215048046629e-136     NO
2 Mardia Kurtosis 39.0584746716299                     0     NO
3             MVN             <NA>                  <NA>     NO

$univariateNormality
               Test  Variable Statistic   p value Normality
1  Anderson-Darling   PERQ1      7.1534  <0.001      NO    
2  Anderson-Darling   PERQ2      6.2090  <0.001      NO    
3  Anderson-Darling   PERQ3      6.6901  <0.001      NO    
4  Anderson-Darling   PERQ4      8.0496  <0.001      NO    
5  Anderson-Darling   PERQ5      6.7105  <0.001      NO    
6  Anderson-Darling   PERQ6      6.6768  <0.001      NO    
7  Anderson-Darling   PERQ7      6.7429  <0.001      NO    
8  Anderson-Darling   CUSA1     10.3268  <0.001      NO    
9  Anderson-Darling   CUSA2      4.7763  <0.001      NO    
10 Anderson-Darling   CUSA3      5.7320  <0.001      NO    

$Descriptives
        n  Mean  Std.Dev Median Min Max 25th 75th       Skew  Kurtosis
PERQ1 250 7.944 1.421600      8   2  10    7    9 -0.7203890 1.1786363
PERQ2 250 7.192 1.891414      7   1  10    6    8 -0.8186657 0.7287376
PERQ3 250 7.700 1.821888      8   1  10    7    9 -0.9064463 0.8310284
PERQ4 250 7.916 1.651622      8   1  10    7    9 -1.0704382 1.7901560
PERQ5 250 7.872 1.453294      8   3  10    7    9 -0.6770313 0.5583053
PERQ6 250 7.776 1.629862      8   1  10    7    9 -0.9743906 1.7870540
PERQ7 250 7.592 1.843674      8   1  10    7    9 -0.9593571 1.0528448
CUSA1 250 7.988 1.233671      8   4  10    7    9 -0.2201919 0.1527890
CUSA2 250 7.128 1.765242      7   1  10    6    8 -0.5471886 0.5579660
CUSA3 250 7.316 1.747099      7   1  10    7    8 -0.6671235 0.9615690

#3. Check Multicollinearity

model_vif <- lm(PERQ1 ~ PERQ2 + PERQ3 + PERQ4 + PERQ5 + PERQ6 + PERQ7 + CUSA1 + CUSA2 + CUSA3, data = assum_dat) 

vif(model_vif)

   PERQ2    PERQ3    PERQ4    PERQ5    PERQ6    PERQ7    CUSA1    CUSA2 
1.486854 2.227753 2.130806 1.794253 1.936840 2.362554 1.741678 2.045506 
   CUSA3 
2.123181

37.8 Visualization of the SEM path diagram

Path diagram provides a graphical representation of the structural relationships, including causality, variances and covariances, between the observed and latent variables in our SEM model.

# Plot SEM path diagram using semPaths in semPlot package
 semPaths(fit_sem, 
    what = "par",  #display edges in the path diagram as weighted.
    whatLabels = "par", #display the unstandardized parameter coefficient
    rotation = 2, 
    edge.label.cex = 0.7,
    fontname ="Helvetica",
 )
 mtext("Fig.2", side = 1, line = 4, at = 0.5, cex = 1) #Add caption

Fig. The structural equation models for the effect of customer percieved quality of service on customers’ satisfaction. Arrows represent the hypothesized causal relationships between the exogenous and endogenous latent variables. The arrow width indicates the strength of the relationship. The values next to the arrows are path coefficients (unstandardized regression coefficients). The broken path lines indicate that the first factor loading is fixed (the Lavaan package model-syntax was designed by default to set up the first factor loading as fixed to set the scale).

# Plot using semPaths in semPlot package
 semPaths(fit_sem, 
          what = "path", #display edges in the path diagram as unweighted.
          whatLabels = "par", 
          rotation = 2, 
          edge.label.cex = 0.7,
          fontname ="Helvetica", 
          edgeOptions = list(color = "black")
 ) 
 mtext("Fig.3", side = 1, line = 4, at = 0.5, cex = 1)

 #semPaths(fit_sem, 
          #what = "std", #display edges in the path diagram as weighted.
          #whatLabels = "std", #display the standardized parameter coefficient
          #rotation = 2,
          #edge.label.cex = 0.7,
          #fontname ="Helvetica",
 #)
 #mtext("Fig.4", side = 1, line = 4, at = 0.5, cex = 1)

37.9 Interpretation

Model fit: Although the model chi-square was less than 0.05 and RMSEA = 0.091, the CFI = 0.943 and TLI = 0.943 indicate the model is a reasonable good fit.
Our result is statistically significant given the p-value of the regression analysis (p <.0001). The finding shows that the customer’s perceived quality of services positively predicts satisfaction, indicating that for every one unit increase in customer perceived quality scores, satisfaction is predicted to increase by 0.76 points.

37.10 Conclusion

In summary, we have used SEM, which is a powerful tool for testing theoretical models, to explore data to gain a deeper understanding of the complex relationships between variables. SEM requires careful specification, estimation, and evaluation to ensure accurate results. It is useful in various fields such as psychology, sociology, education, and business.

37.11 References

Introduction to Structural Equation Modeling (SEM) in R Lavaan. https://stats.oarc.ucla.edu/r/seminars/rsem/#
Liu, X., Swenson, N. G., Lin, D., Mi, X., Umaña, M. N., Schmid, B., & Ma, K. (2016). Linking individual‐level functional traits to tree growth in a subtropical forest. Ecology, 97(9), 2396-2405. https://doi.org/10.1002/ecy.1445
Donaldson, L. (2015). 1. The First Generation: Definition and Brief History of Structural Equation Modeling. Journal of Administrati e Sciences, 12, 182-94. https://www.stats.ox.ac.uk/~snijders/Encyclopedia_SEM_Kaplan.pdf
Structural Equation Modeling https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/structural-equation-modeling/
Soumya Ray & Nicholas Danks. SEMinR. https://cran.r-project.org/web/packages/seminr/vignettes/SEMinR.html#data
Ullman, J. B., & Bentler, P. M. (2012). Structural equation modeling. Handbook of Psychology, Second Edition, 2. https://doi.org/10.1002/9781118133880.hop202023