Activity 21: Independence testing for causal discovery#
2025-12-08
import numpy as np
import pandas as pd
rng = np.random.RandomState(42)
# high number of samples to reduce sampling noise
n_samples = 30000
Run the cell below to generate the simulated data_df for the purposes of this activity:
Show code cell content
# Generate a binary A
A = rng.choice([0, 1], size=n_samples)
# Generate a continuous B
B = rng.normal(loc=np.zeros(n_samples), scale=0.3)
# Generate a binary C
C = rng.normal(loc=A+B, scale=0.3)
C = (C > C.mean()).astype(int)
# add columns to data_df
data_df = pd.DataFrame({'A':A, 'B':B, 'C':C})
For the purposes of this activity, we will assume that correlations greater than 0.1 show a dependence between the two variables. If the correlation between two variables is less than 0.1, we will consider them independent.
We can compute correlations between any pair of variables by selecting the relevant columns from a dataframe, calling df[['col1', 'col2']].corr(), and reading off the non-diagonal elements of the resulting matrix. Letβs use this to test the following independence relationships:
Is A independent of B?
Is A independent of C?
Is B independent of C?
Input which pair(s) of variables are independent in the PollEverywhere: pollev.com/tliu