Generate data from a lavaan model syntax

Generate data based on the parameters of a structural equation model in lavaan model syntax.

generateData(
 .model                    = NULL,
 .empirical                = FALSE,
 .handle_negative_definite = c("stop", "drop", "set_NA"),
 .return_type              = c("data.frame", "matrix", "cor"),
 .N                        = 200,
 .skewness                 = NULL,
 .kurtosis                 = NULL,
 ...
 )

Arguments

.model	A model in lavaan model syntax.
.empirical	Logical. If `TRUE`, mu and Sigma of the normal distribution specify the empirical not the population mean and covariance matrix. Ignored if `return.type = "cor"`. Defaults to `FALSE`.
.handle_negative_definite	Character string. How should negative definite indicator correlation matrices be handled? One of `"stop"`, `"drop"` or `"set_NA"` in which case an `NA` is produced. Defaults to `"stop"`.
.return_type	Character string. One of `"data.frame"`, `"matrix"` or `"cor"` in which case the indicator correlation matrix is returned. Defaults to `"data.frame"`.
.N	Integer. The number of observations to generate. Ignored if `return.type = "cor"`. Defaults to `200`.
.skewness	List. List of predefined values for the skewness of the indicators.
.kurtosis	List. List of predefined values for the kurtosis of the indicators.
...	`"name" = vector_of_values` pairs. `"name"` is a character string giving the label used for the parameter of interest. `vector_of_values` is a numeric vector of values to use for the paramter given by `"name"`.

Value

The generated data. Either as a data.frame (return_type = "data.frame"), a numeric matrix (return.type = "matrix"), or a correlation matrix (return.type = "cor"). If variable parameters have been set a nested tibble is returned.

Details

Generate data for structural equation models including up to 8 constructs if a structural model is given or an unlimited number if only the correlation between constructs is needed. To be precise, if users specify a structural model we support a maximum of 5 exogenous constructs. Depending on the number of exogenous constructs the following number of endogenous constructs is allowed:

If there is 1 exogenous construct : a maximum of 7 endogenous constructs is allowed
If there are 2 exogenous constructs: a maximum of 6 endogenous constructs is allowed
If there are 3 exogenous constructs: a maximum of 5 endogenous constructs is allowed
If there are 4 exogenous constructs: a maximum of 4 endogenous constructs is allowed
If there are 5 exogenous constructs: a maximum of 4 endogenous constructs is allowed

The reason for the limitation is that data is generated such that the model-implied variances of the constructs are always unity. Since the model-implied construct covariance matrix is a complex function of the structural residual variances which are in turn a complex function of the path coefficients the equation for each construct variance grows massively with each additional construct added. Since for a given number of constructs the number of possible model specifications grows rapidly, we solved the variance equations symbolically as a function of the path coefficients in Mathematica. With more than 8 constructs the size of these symbolic representation becomes computationally infeasible.

Generation is based on parameter values given in lavaan model syntax. Currently, linear models and models containing second order constructs are supported. Supplying a model containing nonlinear terms causes an error.

For the structural model equations (~) values are interpreted as path coefficients. For measurement model equations values are taken to be loadings if the concept is modeled as a common factor (=~). If the concept is modeled as a composite (<~) values are interpreted as (unscaled) weights! In the latter case, indicators are allowed to be arbitrarily correlated. Hence, the correlation between indicators needs to be set as well. Indicator correlations measurement error correlations, and correlations between exogenous constructs are set using the (~~) operator. Note that when writing, for instance, x1 ~~ 0.2*x2 (where x1 and x2 are indicators of some construct eta1), the interpretation depends on whether eta1 is modeled as a composite or a common factor. In the former case x1 ~~ 0.2*x2 is a correlation between indicators, in the latter case it is interpreted as a measurement error correlation.

In addition to supplying numeric values, variable values for parameters are allowed. To achieve this, the package makes use of lavaan's labeling capabilities. Users may replace a given parameter in, i.e. the structural model by a symbolic name and assign a vector of values to that name by passing a "name" = vector_of_values argument to generateData(). These values will be used to generate data for all possible combinations of these values with the remaining fixed parameters.

If .return_type is "data.frame" or "matrix" normally distributed data with zero mean and variance-covariance matrix equal to the indicator correlation matrix which would be returned if .return_type = "cor" (i.e., the population indicator correlation matrix) is generated.

Examples

# ==============================================================================
# Without variable parameters
# ==============================================================================
## DGP with constructs modeled as common factors
dgp <- "
# Structural model
eta2 ~ 0.4*eta1
eta3 ~ 0.4*eta1 + 0.35*eta2

# Measurement model
eta1 =~ 0.8*y11 + 0.9*y12 + 0.8*y13
eta2 =~ 0.7*y21 + 0.7*y22 + 0.9*y23
eta3 =~ 0.9*y31 + 0.8*y32 + 0.7*y33
"

dat <- generateData(dgp, .return_type = "cor")
dat
#>        y11    y12    y13    y21    y22    y23    y31    y32    y33
#> y11 1.0000 0.7200 0.6400 0.2240 0.2240 0.2880 0.3888 0.3456 0.3024
#> y12 0.7200 1.0000 0.7200 0.2520 0.2520 0.3240 0.4374 0.3888 0.3402
#> y13 0.6400 0.7200 1.0000 0.2240 0.2240 0.2880 0.3888 0.3456 0.3024
#> y21 0.2240 0.2520 0.2240 1.0000 0.4900 0.6300 0.3213 0.2856 0.2499
#> y22 0.2240 0.2520 0.2240 0.4900 1.0000 0.6300 0.3213 0.2856 0.2499
#> y23 0.2880 0.3240 0.2880 0.6300 0.6300 1.0000 0.4131 0.3672 0.3213
#> y31 0.3888 0.4374 0.3888 0.3213 0.3213 0.4131 1.0000 0.7200 0.6300
#> y32 0.3456 0.3888 0.3456 0.2856 0.2856 0.3672 0.7200 1.0000 0.5600
#> y33 0.3024 0.3402 0.3024 0.2499 0.2499 0.3213 0.6300 0.5600 1.0000

## DGP with a construct modeled as a composite
# If the model contains composites, within-block indicator correlation
# needs to be set as well.
dgp <- "
# Structural model
eta2 ~ 0.2*eta1
eta3 ~ 0.4*eta1 + 0.35*eta2

# Measurement model
eta1 <~ 0.7*y11 + 0.9*y12 + 0.8*y13
eta2 =~ 0.7*y21 + 0.7*y22 + 0.9*y23
eta3 =~ 0.9*y31 + 0.8*y32 + 0.7*y33

# Within block indicator correlation of eta1
y11 ~~ 0.2*y12
y11 ~~ 0.3*y13
y12 ~~ 0.5*y13
"

dat <- generateData(dgp, .return_type = "matrix")
dat[1:4, ]
#>            y11        y12         y13        y21        y22         y23
#> [1,] 0.2833699 -0.1695697  0.19485360 -0.2459982 1.30419075 -0.01983093
#> [2,] 0.2636589 -0.5632186 -0.30880616  1.1322802 0.71194015  0.53392818
#> [3,] 1.6549501  1.7640614  1.39770736  1.9406051 1.41488167  1.23776506
#> [4,] 1.1624829 -1.7732494 -0.06297272 -0.7619076 0.09713996 -0.26068743
#>             y31        y32         y33
#> [1,]  1.6381927  2.4202904 1.409257809
#> [2,] -0.6779495 -2.4443972 0.658444822
#> [3,]  2.1322319  1.6886823 0.628915071
#> [4,]  1.0220640  0.2920795 0.002221929

# ==============================================================================
# With variable parameters
# ==============================================================================
### Linear DGP -----------------------------------------------------------------
# Add a label and assign values to for each name
dgp <- "
# Structural model
eta2 ~ 0.2*eta1
eta3 ~ gamma*eta1 + 0.35*eta2

# Measurement model
eta1 <~ 0.7*y11 + 0.9*y12 + 0.8*y13
eta2 =~ 0.7*y21 + 0.7*y22 + 0.9*y23
eta3 =~ 0.9*y31 + 0.8*y32 + 0.7*y33

# Within block indicator correlation
y11 ~~ 0.2*y12
y11 ~~ 0.3*y13
y12 ~~ epsilon*y13
"

dat <- generateData(dgp,
                    "gamma" = c(-0.4, -0.2, 0, 0.2, 0.4),
                    "epsilon" = c(0.1, 0.2, 0.3), .return_type = "data.frame")
dat
#> # A tibble: 15 x 4
#>       Id gamma epsilon dgp               
#>    <int> <dbl>   <dbl> <list>            
#>  1     1  -0.4     0.1 <df[,9] [200 × 9]>
#>  2     2  -0.2     0.1 <df[,9] [200 × 9]>
#>  3     3   0       0.1 <df[,9] [200 × 9]>
#>  4     4   0.2     0.1 <df[,9] [200 × 9]>
#>  5     5   0.4     0.1 <df[,9] [200 × 9]>
#>  6     6  -0.4     0.2 <df[,9] [200 × 9]>
#>  7     7  -0.2     0.2 <df[,9] [200 × 9]>
#>  8     8   0       0.2 <df[,9] [200 × 9]>
#>  9     9   0.2     0.2 <df[,9] [200 × 9]>
#> 10    10   0.4     0.2 <df[,9] [200 × 9]>
#> 11    11  -0.4     0.3 <df[,9] [200 × 9]>
#> 12    12  -0.2     0.3 <df[,9] [200 × 9]>
#> 13    13   0       0.3 <df[,9] [200 × 9]>
#> 14    14   0.2     0.3 <df[,9] [200 × 9]>
#> 15    15   0.4     0.3 <df[,9] [200 × 9]>

### DGP containing a second order construct ------------------------------------
# Second order constructs are supported as well.
dgp_2ndorder <- "
## Path model / Regressions
eta2 ~ 0.5*eta1
eta3 ~ 0.35*eta1 + 0.4*eta2

## Composite model
eta1 <~ 0.8*y41 + 0.6*y42 + 0.6*y43
eta2 <~ 2*y51 + 3*y52 + 5*y53
c1   <~ 0.8*y11 + 0.4*y12
c2   <~ 0.5*y21 + 0.3*y22 + 0.2*y23 + 0.4*y24

## Higher order composite
eta3 <~ 0.4*c1 + 0.4*c2

## Composite indicator correlations
# eta1
y41 ~~ 0.5*y42
y41 ~~ 0.5*y43
y42 ~~ 0.5*y43

# eta2
y51 ~~ 0.2*y52
y51 ~~ 0.3*y53
y52 ~~ 0.4*y53

# eta3 (the 2nd order construct)
c1 ~~ 0.49*c2

# c1-c2
y11 ~~ 0.3125*y12

y21 ~~ 0.4*y22
y21 ~~ 0.3*y23
y21 ~~ 0.31*y24
y22 ~~ 0.28*y23
y22 ~~ 0.31*y24
y23 ~~ 0.3*y24
"

dat <- generateData(dgp_2ndorder, .return_type = "data.frame", .empirical = TRUE)
dat[1:5, ]
#>          y41        y42         y43        y11        y12        y21
#> 1  1.6501001  1.0591335  1.04291831  2.0596873  0.1926962  0.7920012
#> 2 -0.2852002  0.4285055  1.48444017  0.1242391  0.7418071  0.6104972
#> 3  0.4569223 -0.2404468  0.74690311  1.1072126  0.4188807  0.8521478
#> 4  0.1503388  0.5890999  0.04658271  1.0498477 -0.7476978  0.6602936
#> 5  1.2367091  0.7017904 -0.59327101 -0.3491474 -0.4072340 -0.6550728
#>           y22       y23        y24        y51        y52        y53
#> 1  0.07741937 0.9016692 0.05564948 -0.3327969 -0.6952362 -0.3423606
#> 2 -0.29478640 1.4400117 1.31828329  1.2679078  1.7665792  0.3374870
#> 3  0.53409022 1.3029529 1.17307243 -0.9371740 -0.4217491  0.5689144
#> 4  0.39212851 1.5824987 0.83750277  1.5860984  1.0826290  0.6578463
#> 5 -0.70035840 1.1851110 1.89050705  1.2829048 -1.6464734  0.4250665

## Estimate using cSEM
require(cSEM)
#> Loading required package: cSEM
#> 
#> Attaching package: ‘cSEM’
#> The following object is masked from ‘package:stats’:
#> 
#>     predict

aa <- cSEM::csem(dat, dgp_2ndorder)
cSEM::summarize(aa) ## parameters estimates are identical to the DGP
#> ________________________________________________________________________________
#> ----------------------------------- Overview -----------------------------------
#> 
#> 	General information:
#> 	------------------------
#> 	Estimation status                = Ok
#> 	Number of observations           = 200
#> 	Weight estimator                 = PLS-PM
#> 	Inner weighting scheme           = path
#> 	Type of indicator correlation    = Pearson
#> 	Path model estimator             = OLS
#> 	Second order approach            = 2stage
#> 	Type of path model               = Linear
#> 	Disattenuated                    = No
#> 
#> 	Construct details:
#> 	------------------
#> 	Name  Modeled as     Order         Mode 
#> 
#> 	eta1  Composite      First order   modeB
#> 	c1    Composite      First order   modeB
#> 	c2    Composite      First order   modeB
#> 	eta2  Composite      First order   modeB
#> 	eta3  Composite      Second order  modeB
#> 
#> ----------------------------------- Estimates ----------------------------------
#> 
#> Estimated path coefficients:
#> ============================
#>   Path           Estimate  Std. error   t-stat.   p-value
#>   eta2 ~ eta1      0.5000          NA        NA        NA
#>   eta3 ~ eta1      0.3500          NA        NA        NA
#>   eta3 ~ eta2      0.4000          NA        NA        NA
#> 
#> Estimated loadings:
#> ===================
#>   Loading        Estimate  Std. error   t-stat.   p-value
#>   eta1 =~ y41      0.8552          NA        NA        NA
#>   eta1 =~ y42      0.7941          NA        NA        NA
#>   eta1 =~ y43      0.7941          NA        NA        NA
#>   c1 =~ y11        0.9250          NA        NA        NA
#>   c1 =~ y12        0.6500          NA        NA        NA
#>   c2 =~ y21        0.8040          NA        NA        NA
#>   c2 =~ y22        0.6800          NA        NA        NA
#>   c2 =~ y23        0.5540          NA        NA        NA
#>   c2 =~ y24        0.7080          NA        NA        NA
#>   eta2 =~ y51      0.5365          NA        NA        NA
#>   eta2 =~ y52      0.7066          NA        NA        NA
#>   eta2 =~ y53      0.8898          NA        NA        NA
#>   eta3 =~ c1       0.8631          NA        NA        NA
#>   eta3 =~ c2       0.8631          NA        NA        NA
#> 
#> Estimated weights:
#> ==================
#>   Weights        Estimate  Std. error   t-stat.   p-value
#>   eta1 <~ y41      0.4887          NA        NA        NA
#>   eta1 <~ y42      0.3665          NA        NA        NA
#>   eta1 <~ y43      0.3665          NA        NA        NA
#>   c1 <~ y11        0.8000          NA        NA        NA
#>   c1 <~ y12        0.4000          NA        NA        NA
#>   c2 <~ y21        0.5000          NA        NA        NA
#>   c2 <~ y22        0.3000          NA        NA        NA
#>   c2 <~ y23        0.2000          NA        NA        NA
#>   c2 <~ y24        0.4000          NA        NA        NA
#>   eta2 <~ y51      0.2617          NA        NA        NA
#>   eta2 <~ y52      0.3926          NA        NA        NA
#>   eta2 <~ y53      0.6543          NA        NA        NA
#>   eta3 <~ c1       0.5793          NA        NA        NA
#>   eta3 <~ c2       0.5793          NA        NA        NA
#> 
#> Estimated measurement error correlations:
#> =========================================
#>   Correlation    Estimate  Std. error   t-stat.   p-value
#>   y41 ~~ y42      -0.1791          NA        NA        NA
#>   y41 ~~ y43      -0.1791          NA        NA        NA
#>   y42 ~~ y43      -0.1306          NA        NA        NA
#>   y11 ~~ y12      -0.2887          NA        NA        NA
#>   y21 ~~ y22      -0.1467          NA        NA        NA
#>   y21 ~~ y23      -0.1454          NA        NA        NA
#>   y21 ~~ y24      -0.2592          NA        NA        NA
#>   y22 ~~ y23      -0.0967          NA        NA        NA
#>   y22 ~~ y24      -0.1714          NA        NA        NA
#>   y23 ~~ y24      -0.0922          NA        NA        NA
#>   y51 ~~ y52      -0.1791          NA        NA        NA
#>   y51 ~~ y53      -0.1774          NA        NA        NA
#>   y52 ~~ y53      -0.2288          NA        NA        NA
#> 
#> Estimated indicator correlations:
#> =================================
#>   Correlation    Estimate  Std. error   t-stat.   p-value
#>   y41 ~~ y42       0.5000          NA        NA        NA
#>   y41 ~~ y43       0.5000          NA        NA        NA
#>   y42 ~~ y43       0.5000          NA        NA        NA
#>   y11 ~~ y12       0.3125          NA        NA        NA
#>   y21 ~~ y22       0.4000          NA        NA        NA
#>   y21 ~~ y23       0.3000          NA        NA        NA
#>   y21 ~~ y24       0.3100          NA        NA        NA
#>   y22 ~~ y23       0.2800          NA        NA        NA
#>   y22 ~~ y24       0.3100          NA        NA        NA
#>   y23 ~~ y24       0.3000          NA        NA        NA
#>   y51 ~~ y52       0.2000          NA        NA        NA
#>   y51 ~~ y53       0.3000          NA        NA        NA
#>   y52 ~~ y53       0.4000          NA        NA        NA
#> 
#> ------------------------------------ Effects -----------------------------------
#> 
#> Estimated total effects:
#> ========================
#>   Total effect    Estimate  Std. error   t-stat.   p-value
#>   eta2 ~ eta1       0.5000          NA        NA        NA
#>   eta3 ~ eta1       0.5500          NA        NA        NA
#>   eta3 ~ eta2       0.4000          NA        NA        NA
#> 
#> Estimated indirect effects:
#> ===========================
#>   Indirect effect    Estimate  Std. error   t-stat.   p-value
#>   eta3 ~ eta1          0.2000          NA        NA        NA
#> ________________________________________________________________________________

Arguments

Value

Details

Examples

Contents