The causalDisco causal discovery web tool uses five simulated datasets. These are the data that the code examples are run on. These five datasets are:
numData: A dataset of 5 numeric variables,X1,X2,X3,ZandY.catData: A dataset of 5 categorical variables,X1,X2,X3,ZandY, each taking the valuesa,b,c,d,e,f.mixData: A dataset of 5 variables, where the variablesX1,X2andX3are identical to those fromcatData, while the variablesZandYare identical to those fromnumData.catData_mcar: A modification ofcatDatawhere there has been introduced missing observations completely at random in some variables.X1has 10% missing values,X2has 5% missing values andX3has 20% missing values.numData_latent: A modification ofnumDatawhere the variableZis not included and thus can be considered latent.
The code for generating the datasets is included below. Please note that the datasets do not necessarily live up to all assumptions for the proceures presented here and is only intended as example data for the syntax of the functions – not for measuring performance.
The causal model for the first four datasets are represented in Figure 1 below.
Figure 1: True DAG for the datasets numData, catData, mixData and catData_mcar.
In the last dataset, numData_latent, the variable Z is unobserved and thus it can represented by a MAG obtained by replacing Z with a double-headed arrow from X1 to Y. We also provide this graph below in Figure 2.
Figure 2: True MAG for the dataset numData_latent.
The simulated datasets were generated in R using the following code:
############################################################################################
######################Simulate data to be used for causalDisco code examples################
############################################################################################
#All data sets use the numerical data as it offset. This dataset is simulated as
#a mix of linear and non-linear structural equations with additive noise. The noise
#components are Gaussian or uniform.
#The datasets are stored in data.frames and converted to other formats (e.g. matrix)
#in the code examples, if needed.
#sample size
n <- 1000
#Simulate numData
set.seed(123)
numData <- data.frame(Z = abs(rnorm(n, mean = 10)))
numData$X1 <- sqrt(numData$Z) + runif(n, min = 0, max = 2)
numData$X3 <- runif(n, min = 5, max = 10)
numData$X2 <- 2*numData$X3 - rnorm(n, mean = 5)
numData$Y <- numData$X1^2 + numData$X2 - numData$X3 - numData$Z + rnorm(n, mean = 10)
numData <- numData[, c("X1", "X2", "X3", "Z", "Y")]
#Make catData
catData <- as.data.frame(sapply(numData, function(x) cut(x, breaks = 5,
labels = letters[1:5])))
#Make mixData
mixData <- numData
mixData$X1 <- catData$X1
mixData$X2 <- catData$X2
mixData$X3 <- catData$X3
#Make catData_mcar
catData_mcar <- catData
set.seed(1234)
catData_mcar$X1[sample(1:n, 100)] <- NA
catData_mcar$X2[sample(1:n, 50)] <- NA
catData_mcar$X3[sample(1:n, 200)] <- NA
#Make numData_latent
numData_latent <- numData[, c("X1", "X2", "Z", "Y")]