The causalDisco causal discovery web tool uses five simulated datasets. These are the data that the code examples are run on. These five datasets are:
numData
: A dataset of 5 numeric variables,X1
,X2
,X3
,Z
andY
.catData
: A dataset of 5 categorical variables,X1
,X2
,X3
,Z
andY
, each taking the valuesa
,b
,c
,d
,e
,f
.mixData
: A dataset of 5 variables, where the variablesX1
,X2
andX3
are identical to those fromcatData
, while the variablesZ
andY
are identical to those fromnumData
.catData_mcar
: A modification ofcatData
where there has been introduced missing observations completely at random in some variables.X1
has 10% missing values,X2
has 5% missing values andX3
has 20% missing values.numData_latent
: A modification ofnumData
where the variableZ
is not included and thus can be considered latent.
The code for generating the datasets is included below. Please note that the datasets do not necessarily live up to all assumptions for the proceures presented here and is only intended as example data for the syntax of the functions – not for measuring performance.
The causal model for the first four datasets are represented in Figure 1 below.
In the last dataset, numData_latent
, the variable Z
is unobserved and thus it can represented by a MAG obtained by replacing Z
with a double-headed arrow from X1
to Y
. We also provide this graph below in Figure 2.
The simulated datasets were generated in R using the following code:
############################################################################################
######################Simulate data to be used for causalDisco code examples################
############################################################################################
#All data sets use the numerical data as it offset. This dataset is simulated as
#a mix of linear and non-linear structural equations with additive noise. The noise
#components are Gaussian or uniform.
#The datasets are stored in data.frames and converted to other formats (e.g. matrix)
#in the code examples, if needed.
#sample size
n <- 1000
#Simulate numData
set.seed(123)
numData <- data.frame(Z = abs(rnorm(n, mean = 10)))
numData$X1 <- sqrt(numData$Z) + runif(n, min = 0, max = 2)
numData$X3 <- runif(n, min = 5, max = 10)
numData$X2 <- 2*numData$X3 - rnorm(n, mean = 5)
numData$Y <- numData$X1^2 + numData$X2 - numData$X3 - numData$Z + rnorm(n, mean = 10)
numData <- numData[, c("X1", "X2", "X3", "Z", "Y")]
#Make catData
catData <- as.data.frame(sapply(numData, function(x) cut(x, breaks = 5,
labels = letters[1:5])))
#Make mixData
mixData <- numData
mixData$X1 <- catData$X1
mixData$X2 <- catData$X2
mixData$X3 <- catData$X3
#Make catData_mcar
catData_mcar <- catData
set.seed(1234)
catData_mcar$X1[sample(1:n, 100)] <- NA
catData_mcar$X2[sample(1:n, 50)] <- NA
catData_mcar$X3[sample(1:n, 200)] <- NA
#Make numData_latent
numData_latent <- numData[, c("X1", "X2", "Z", "Y")]