causalDisco

The causalDisco causal discovery web tool uses five simulated datasets. These are the data that the code examples are run on. These five datasets are:

numData: A dataset of 5 numeric variables, X1, X2, X3, Z and Y.
catData: A dataset of 5 categorical variables, X1, X2, X3, Z and Y, each taking the values a, b, c, d, e, f.
mixData: A dataset of 5 variables, where the variables X1, X2 and X3 are identical to those from catData, while the variables Z and Y are identical to those from numData.
catData_mcar: A modification of catData where there has been introduced missing observations completely at random in some variables. X1 has 10% missing values, X2 has 5% missing values and X3 has 20% missing values.
numData_latent: A modification of numData where the variable Z is not included and thus can be considered latent.

The code for generating the datasets is included below. Please note that the datasets do not necessarily live up to all assumptions for the proceures presented here and is only intended as example data for the syntax of the functions – not for measuring performance.

The causal model for the first four datasets are represented in Figure 1 below.

Figure 1: True DAG for the datasets numData, catData, mixData and catData_mcar.

In the last dataset, numData_latent, the variable Z is unobserved and thus it can represented by a MAG obtained by replacing Z with a double-headed arrow from X1 to Y. We also provide this graph below in Figure 2.

Figure 2: True MAG for the dataset numData_latent.

The simulated datasets were generated in R using the following code:

############################################################################################
######################Simulate data to be used for causalDisco code examples################
############################################################################################
#All data sets use the numerical data as it offset. This dataset is simulated as 
#a mix of linear and non-linear structural equations with additive noise. The noise 
#components are Gaussian or uniform. 
#The datasets are stored in data.frames and converted to other formats (e.g. matrix) 
#in the code examples, if needed.

#sample size
n <- 1000 

#Simulate numData
set.seed(123)
numData <- data.frame(Z = abs(rnorm(n, mean = 10))) 
numData$X1 <- sqrt(numData$Z) + runif(n, min = 0, max = 2)
numData$X3 <- runif(n, min = 5, max = 10)
numData$X2 <- 2*numData$X3 - rnorm(n, mean = 5) 
numData$Y <- numData$X1^2 + numData$X2 - numData$X3 - numData$Z + rnorm(n, mean = 10)
numData <- numData[, c("X1", "X2", "X3", "Z", "Y")]

#Make catData
catData <- as.data.frame(sapply(numData, function(x) cut(x, breaks = 5,
                                                         labels = letters[1:5])))

#Make mixData
mixData <- numData
mixData$X1 <- catData$X1
mixData$X2 <- catData$X2
mixData$X3 <- catData$X3

#Make catData_mcar
catData_mcar <- catData
set.seed(1234)
catData_mcar$X1[sample(1:n, 100)] <- NA
catData_mcar$X2[sample(1:n, 50)] <- NA
catData_mcar$X3[sample(1:n, 200)] <- NA

#Make numData_latent
numData_latent <- numData[, c("X1", "X2", "Z", "Y")]

Welcome to the causalDisco web tool!

This tool provides an overview of available causal discovery procedures in R for a given causal discovery problem, along with code examples and links to where you can learn more about each procedure. I hope you find the tool useful!

Using the causalDisco tool

In order to use the tool, you should first look at the panel to the left and check off the properties that you need for your causal discovery problem. causalDisco will then provide a list of available R procedures that may be applicable to your problem, and if you click on one of them, you will find more information about the procedure, including a working code example. All of the code examples are run on simulated example data. You can learn more about the simulated data by clicking the About the data pane above.

Contact information

The causalDisco web tool was developed by Anne Helby Petersen, PhD student at University of Copenhagen. If you have questions, comments, ideas for improvements or a bug to report, please open an issue on Github or contact me at ahpe [at] sund [dot] ku [dot] dk.