Project 1 š¶#
Randomized Experiments
āTODO your name here
Collaboration Statement
TODO brief statement on the nature of your collaboration.
TODO your collaboratorās names here.
In this project, we will conduct our first causal study start to finish by examining the Infant Health and Development Program (IHDP) dataset, which is a classic dataset in the causal inference literature.
Learning Objectives#
Practice following the causal roadmap with a real-world dataset.
Explore the benefits of stratification in a randomized experiment.
Translate mathematical quantities into code by implementing two difference-in-means estimators.
Exercise your data manipulation and visualization skills with real and simulated data.
Note
This project is due Monday 3/3 at 11:55pm.
Table of Contents and Rubric#
Section |
Points |
---|---|
2 |
|
2 |
|
2.5 |
|
0.5 |
|
2.5 |
|
0.5 |
|
Total |
10 pts |
Grading guidelines
The course projects offer an opportunity to practice the full causal inference workflow, from building estimators and formulating questions to conducting analyses and communicating your findings effectively. Here are some guidelines on how to approach the project:
Like the worksheets, a portion of points will be autograded ā feel free to submit as many times as you want to check your codeās correctness!
For visualizations:
Help your reader understand your findings through visually clear figures
Label your axes, provide legends when appropriate, and add figure titles to provide context
For written responses:
Support your ideas with specific evidence from your analysis or prior knowledge
Write concisely but clearly ā one-word/one-phrase answers usually donāt give enough space to show what youāve learned
If youāre uncertain about any portion of the project, please do come to office hours, TA hours, or reach out on Ed!
How to submit#
Like our worksheets, follow the instructions on the course website to submit your completed proj1.ipynb
and proj1.py
files, along with your causal graph image for the Infant Health and Development Program analysis.
Notebook imports#
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from ipywidgets import interact, interact_manual, fixed
1. Difference-in-means#
We begin by simulating a randomized experiment, where there are the following variables:
\(T\): the binary intervention of interest
\(Y\): the outcome of interest
\(X\): a binary variable that affects the outcome of interest, which we will refer to as a covariate
We provide the following code for simulating a random experiment with a binary treatment \(T\) and binary covariate \(X\):
rng = np.random.default_rng(seed=42)
def sim_random_exp(n_samples=1000, treatment_effect=1.0, covariate_effect=3.0):
"""
Simulate a random experiment with a binary treatment and covariate.
Args:
n_samples (int): the number of samples to simulate
treatment_effect (float): the magnitude of the effect of the treatment on the outcome
covariate_effect (float): the magnitude of the effect of the covariate on the outcome
Returns:
Y (np.ndarray): the observed outcome
T (np.ndarray): the binary treatment assignment
X (np.ndarray): the binary covariate
"""
# Generate potential outcomes
Y0 = rng.normal(size=n_samples)
Y1 = Y0 + treatment_effect
# Randomly assign treatment
T = rng.choice([0, 1], size=n_samples, )
# Create a binary covariate that affects the outcome
X = rng.choice([0, 1], size=n_samples)
# Generate the observed outcome
Y = np.where(T == 1, Y1, Y0) + covariate_effect*X
return Y, T, X
Note
The parameter covariate_effect
is the magnitude of the effect of the covariate on the outcome ā the larger the value, the stronger the effect of the covariate on the outcome.
Under a randomized experiment study design, we are able to identify the average treatment effect (\(ATE\)):
We can then estimate the causal effect by taking the difference in means between the treated and control groups, which is known as the difference-in-means estimator:
where \(n_1\) is the number of samples in the treatment group, and \(n_0\) is the number of samples in the control group. That is:
1.1. Implement diff_in_means
#
def diff_in_means(Y, T):
"""
Computes the difference in means between the treatment and control groups.
Args:
Y (np.ndarray or pd.Series): the observed outcome
T (np.ndarray or pd.Series): the binary treatment assignment
Returns:
float: the difference in means estimate
"""
assert Y.shape == T.shape, "Y and T must have the same shape"
# TODO your code here
return 0
test_Y = np.array([2, 1, 2, 1])
test_T = np.array([1, 0, 1, 0])
assert np.isclose(diff_in_means(test_Y, test_T), 1)
# Feel free to add more tests
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[3], line 20
17 test_Y = np.array([2, 1, 2, 1])
18 test_T = np.array([1, 0, 1, 0])
---> 20 assert np.isclose(diff_in_means(test_Y, test_T), 1)
21 # Feel free to add more tests
AssertionError:
1.2. Implement stratified_diff_in_means
#
We saw in class that if we want to include a covariate in our analysis, we can analogously compute the difference in means between the treated and control groups, but now conditioning on the covariate.
We implement this here as a stratified difference-in-means estimator, which we call \(\widehat{ATE}_\text{stratified}\):
where \(K\) are the total number of strata, \(n_k\) is the number of samples in stratum \(k\), and \(\widehat{ATE}_k\) is the difference-in-means estimator for stratum \(k\). This is saying that weāre computing the difference in means for each stratum, and then taking a weighted average of the stratum-level estimates, where the weights are the proportion of samples in each stratum. \(\widehat{ATE}_k\) is defined as:
where \(n_{k1}\) is the number of samples in the treatment group for stratum \(k\), and \(n_{k0}\) is the number of samples in the control group for stratum \(k\). That is:
Weāll be implementing this estimator for the case when \(X\) is binary so there are only two strata, \(k \in \{0, 1\}\).
Hint
While these equations may look a bit complicated, notice that each individual summation term in \(\widehat{ATE}_k\) is just a mean of the outcome for a given subset of the data. In particular, in our case where \(X\) is binary:
\(\frac{1}{n_{11}} \sum_{i=1}^{n} \mathbb{I}(X_i = 1 \text{ and } T_i = 1) Y_i\) is the mean of the outcome when \(X=1\) and \(T=1\)
\(\frac{1}{n_{10}} \sum_{i=1}^{n} \mathbb{I}(X_i = 1 \text{ and } T_i = 0) Y_i\) is the mean of the outcome when \(X=1\) and \(T=0\)
and analogously for \(X=0\)
You can then use numpyās boolean indexing to compute the means for each stratum.
def diff_in_means_stratified(Y, T, X):
"""
Compute the difference in means stratified by binary covariate X.
NOTE: stratified difference in means may be undefined if strata are empty.
Args:
Y (np.ndarray or pd.Series): the observed outcome
T (np.ndarray or pd.Series): the binary treatment assignment
X (np.ndarray or pd.Series): the binary covariate
Returns:
float: the difference in means estimate
"""
# TODO your code here
return 0
test_Y = np.array([2, 1, 2, 1])
test_T = np.array([1, 0, 1, 0])
test_X = np.array([1, 1, 0, 0])
assert np.isclose(diff_in_means_stratified(test_Y, test_T, test_X), 1)
# Feel free to add more tests
2. Exploring the efficiency of difference-in-means estimators#
2.1 Interactive widget#
Next, complete the gen_experiment_results
function to simulate n_experiments
random experiments and compute the difference-in-means and stratified difference-in-means estimates.
Tip
The function declaration below uses the **kwargs
syntax to pass additional keyword arguments to the sim_random_exp
function. This is a useful way to write functions that are flexible and can take in different sets of arguments. For more information on the **kwargs
syntax, see this tutorial.
def gen_experiment_results(n_experiments=10000, **kwargs):
"""
Simulates n_experiments random experiments and computes the difference-in-means
and stratified difference-in-means estimators.
Args:
n_experiments (int): the number of experiments to simulate
**kwargs: additional keyword arguments to pass directly `sim_random_exp`
Returns:
pd.DataFrame: a dataframe with the results of the experiments in each row
The dataframe should have two columns:
- diff_in_means: the difference in means estimator
- diff_in_means_stratified: the stratified difference in means estimator
"""
experiment_results = {
"diff_in_means": [],
"diff_in_means_stratified": []
}
for i in range(n_experiments):
# TODO pass **kwargs to sim_random_exp to return Y, T, X
# TODO populate the experiment_results dictionary
# Converts dictionary to dataframe with keys as column names
exp_df = pd.DataFrame(experiment_results)
return exp_df
Using the gen_experiment_results
function, create a manual interactive widget that plots the both the stratified and unstratified difference-in-means for a range of covariate effects. Specifically, you should:
Add a
interact_manual
decorator to theplot_experiment_simulations
functionCreate a slider for the \(X\) variable effect that can take values between 0 and 6
Fix the number of experiments to 10000, see the tutorial reference on ipywidgets from Worksheet 3
Call the
gen_experiment_results
function with the given number of experiments and covariate effectPlot a histogram of the difference-in-means, unstratified and stratified for a given covariate effect: this should be a single plot with two histograms on top of each other, which can be achieved using
sns.histplot
Additionally, plot the true difference in means as a vertical line on the same figure, which can be achieved using
plt.axvline
. The default treatment effect is1.0
Note
Remember to follow good figure design practices by adding a legend, title, and axis labels.
# TODO add interact_manual decorator
def plot_experiment_simulations(num_experiments=10000, covariate_effect=3.0):
"""
Plots the both the stratified and unstratified difference-in-means
Args:
num_experiments (int): the number of experiments to simulate
covariate_effect (float): the covariate effect to use in the experiments
Returns:
None, but shows the plot
"""
# TODO your code here
pass
2.2 Interpretation#
Recall our discussions of the variance of causal effect estimators:
Variance: how much do the estimates for a given experiment vary? We can visually inspect this by looking at how āspread outā the histogram is.
In addition to variance, researchers may also be interested in the statistical bias of an estimator:
Bias: how far off is the mean estimate off from the true value? We can visually inspect this by looking at where the histogram is ācentered.ā If the histogram is centered around the true value, then the estimator is unbiased, and if the histogram is not centered around the true value, then the estimator is said to be biased.
Play around with the widget you created above, testing out different covariate effect values.
What do you observe about the bias of the unstratified and stratified estimators? Is either biased?
What do you observe about the variance of the unstratified and stratified estimators? Researchers sometime call estimators with lower variance to be more efficient. Is there an estimator that appears to be more efficient?
What happens to the relative variance of the two estimators when you set the covariate effect to 0, and why you think this occurs?
TODO your responses here:
3. The Infant Health and Development Program#
Letās now dive into analyzing a real-world study: the Infant Health and Development Program (IHDP). The IHDP was a randomized experiment conducted in the 1980s to evaluate the effectiveness of an early intervention program for low-birthweight, prematurely born infants.
3.1 Prior Knowledge and Causal Question#
Read the following resources on the IHDP study:
Brooks-Gunn et al. 1991, pages 350-352
Content warning
The reading contains the term āmental retardation,ā which was commonly used in medical and policy discussions at the time but is now outdated and harmful. Today, the term āintellectual disabilityā is preferred and is used medically and federally.
Reading Notes
The Brooks-Gunn et al. 1991 paper contains minute details about the study recruitment and data collection process. Donāt worry about understanding every aspect of the study, but try to get a sense of the overall experimental design and the baseline characteristics reported in Table 1.
The original methods mention stratification by both birthweight and clinical site in the study design. Because the probability of treatment is the same for all stratified groups, both the stratified difference-in-means and unstratified difference-in-means are valid estimators. For the purposes of this project, we will analyze the efficiency of estimators that post-stratify by other covariates.
Answer the following questions:
The outcome of interest we will be analyzing is the Stanford-Binet Intelligence Scale score (IQ) at an age of 36 months (3 years), and our causal quantity is the average treatment effect (ATE) on the childās IQ score. Describe what causal question we are trying to answer in this dataset in terms of the causal quantity we are trying to identify.
State what \(Y(1)\) and \(Y(0)\) are in your own words given the treatment and outcome of the study.
TODO your responses here:
3.2 Study Design#
Next, letās load the IHDP dataset and look at the data:
ihdp_raw = pd.read_csv('~/COMSC-341CD/data/ihdp_real.csv')
ihdp_raw.info()
There are 50 total columns in the dataframe, but we will only use the following columns for the project:
bw
: birth weight in gramsmomage
: motherās age in yearsnnhealth
: neonatal health index, higher scores indicate better healthmomwhite
: binary indicator for maternal ethnicity: 1 if white, 0 otherwiseiqsb.36
: outcome: Stanford-Binet IQ score at 36 monthstreat
: binary treatment assignment
Our study design is a randomized experiment, and we will assume that all of the assumptions that are needed to identify the average treatment effect hold. Answer the following questions:
Suppose that our data did not come from a randomized experiment, and families were able to choose whether or not to participate in the intervention. Select two of the four variables listed above that you think may confound the relationship between the treatment and outcome, and explain why.
Your response: TODO
Given the two variables you selected in the previous question, treatment, and outcome, draw a DAG with these four variables if this data were not collected from a randomized experiment. Give each node a variable name (e.g. \(T\), \(Y\), \(C_1\), \(C_2\)) and indicate what each variable represents.
TODO draw your DAG
Draw the modified DAG from the previous question under the scenario where the treatment is randomized.
TODO draw your DAG
Submission note
For this question, you can either hand-draw the graphs on paper and submit a photo or use an online tool of your choice. Please make sure the image is submitted as a separate file on Gradescope.
3.3 Visualizing and cleaning the data#
Letās next ensure that the baseline characteristics are balanced between the treatment and control groups. Complete the baseline_stats
function below to verify that the bw
, momage
, nnhealth
, and momwhite
characteristics are balanced and roughly match the values reported in Table 1 of Brooks-Gunn et al 1991.
Tip
It is possible to complete this operation in one line by using pandas
groupby()
+agg()
Since
momwhite
is coded as a binary variable, the mean will be the proportion ofmomwhite = 1
in the sample.
def baseline_stats(df, covariate):
"""
Computes the mean and standard deviation of a covariate
for the treatment and control groups.
Assumes that the treatment indicator column is named `treat`.
Args:
df (pd.DataFrame): the dataframe to compute the baseline stats
covariate (str): the name of the covariate to compute the baseline stats
Returns:
pd.DataFrame: a grouped dataframe with the mean and standard deviation
of the covariate for the treatment and control groups
"""
# TODO your code here
pass
# display(baseline_stats(ihdp_raw, 'bw'))
# display(baseline_stats(ihdp_raw, 'momage'))
# display(baseline_stats(ihdp_raw, 'nnhealth'))
# display(baseline_stats(ihdp_raw, 'momwhite'))
Beyond having similar mean and standard deviation, an implication of exchangeability is that the treatment and control groups should have a similar distribution of covariates. We can inspect this by plotting the (cumulative) distribution of the covariates for the treatment and control groups and visually inspecting whether the two distributions are similar.
Letās verify this is the case for our continuous variables bw
, momage
, and nnhealth
, using a distribution plot via sns.kdeplot
. Specfically, complete the plot_covariate_distribution
function below by generating a kde plot with:
the covariate on the x-axis
treat
as the hueihdp_df
as the datacommon_norm=False
to ensure that the y-axis is on the same scale for both the treatment and control groupsoptional:
cumulative=True
to plot the cumulative distribution, to make the curves easier to compare
Note
For the statistically curious: a natural hypothesis to test here is whether the two datasets come from the same distribution, which can be done using a two-sample Kolmogorov-Smirnov test.
def plot_covariate_distribution(df, covariate, title):
"""
Plots the (optionally cumulative) distribution of a covariate for the treatment and control groups.
Args:
df (pd.DataFrame): the dataframe to plot the covariate distribution
covariate (str): the name of the covariate to plot
title (str): the title of the plot
Returns:
None, but shows the covariate distribution plot
"""
# TODO your code here
plt.show()
#plot_covariate_distribution(ihdp_raw, 'bw', 'TODO title')
#plot_covariate_distribution(ihdp_raw, 'momage', 'TODO title')
#plot_covariate_distribution(ihdp_raw, 'nnhealth', 'TODO title')
Our outcome is iqsb.36
, which are the infantās IQ score at 36 months. However, there are some missing values in the iqsb.36
column, so we need to drop the rows that do not have the outcome.
Complete the load_and_clean_ihdp_data
function below, and then run the cell to load the cleaned dataset.
def load_and_clean_ihdp_data():
"""
Loads the IHDP dataset, selecting the columns of interest and dropping the
rows that do not have the outcome.
Returns:
pd.DataFrame: the cleaned IHDP dataset with the following columns selected:
bw: birth weight in grams
momage: mother's age in years
iqsb.36: IQ score at 36 months
nnhealth: neonatal health index
momwhite: binary indicator for maternal ethnicity: 1 if white, 0 otherwise
treat: binary treatment assignment
"""
ihdp_df = pd.DataFrame()
# TODO your code here
assert ihdp_df.shape[0] == ihdp_df['iqsb.36'].notna().sum()
return ihdp_df
ihdp_clean = load_and_clean_ihdp_data()
4. Bootstrap utility functions#
Since we only have one real dataset, we cannot generate a distribution of the difference-in-means estimator over multiple experiments. However, we can use bootstrapping to generate a distribution of the difference-in-means estimator.
As we saw in worksheet 3 and in class, bootstrapping is a powerful technique that allows us to generate a sampling distribution over any estimator by resampling the data with replacement. Copy your bootstrap_dfs
function from Worksheet 3 below, which takes in a dataframe and returns a list of bootstrapped dataframes.
def bootstrap_dfs(df, n_bootstraps=5000):
"""
Bootstraps the dataframe `n_bootstraps` times.
Args:
df (pd.DataFrame): the dataframe to bootstrap
n_bootstraps (int): the number of bootstraps to generate
Returns:
list[pd.DataFrame]: a list of bootstrapped dataframes
"""
# TODO your code here
pass
Next, weāll write a utility function to compute the bootstrapped confidence interval given a list of bootstrap values, similar to how we did in Worksheet 3.
Note
This function takes in an alpha level of 0.05 by default, which corresponds to a 95% confidence interval. However, this implementation allows for any alpha level to be used, e.g. for 90% or 99% confidence intervals.
Also note that while the np.percentile
function takes in a percentage between 0 and 100, the alpha
parameter is a proportion between 0 and 1. Be sure to convert to correct units!
def bootstrap_ci(bootstrap_values, alpha=0.05):
"""
Computes the confidence interval using the percentile method.
Args:
bootstrap_values (list[float]): the bootstrapped values
alpha (float): the significance level, defaults to 0.05
Returns:
list[float]: the confidence interval [lower, upper]
"""
# TODO your code here
pass
5 Estimation and Interpretation#
5.1 Difference-in-means estimation#
Weāll now want to apply our difference-in-means estimators to the IHDP dataset. In order to use our stratified difference-in-means estimator, we need to binarize the continuous covariates we are analyzing. Complete the following binarize_covariate
function, which creates a new binary covariate that is 1 if the covariate is greater than the cutpoint, and 0 otherwise.
Then, apply the binarize_covariate
function to the momage
and nnhealth
covariates:
using a cutpoint of 22 years for
momage
using a cutpoint of 100 for
nnhealth
def binarize_covariate(df, covariate, cutpoint):
"""
Creates a new binary covariate that is 1 if the covariate is greater
than the cutpoint, and 0 otherwise.
Updates the dataframe with a new column named `{covariate}_bin`
Args:
df (pd.DataFrame): the dataframe to update
covariate (str): the name of the covariate to binarize
cutpoint (float): the cutpoint to use for binarization
Returns:
pd.DataFrame: the updated dataframe with the new binary covariate
"""
# TODO your code here
return df
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
assert binarize_covariate(df, 'A', 2.5).equals(pd.DataFrame({'A': [1, 2, 3, 4, 5], 'A_bin': [0, 0, 1, 1, 1]}))
# TODO call binarize_covariate on the `momage` and `nnhealth` covariates
Finally, letās compute our results. For the following difference-in-means estimators:
unstratified difference-in-means estimator for the
iqsb.36
outcomestratified difference-in-means estimator for the
iqsb.36
outcome, binarized bymomage
stratified difference-in-means estimator for the
iqsb.36
outcome, binarized bynnhealth
stratified difference-in-means estimator for the
iqsb.36
outcome, binarized bymomwhite
Compute and report the following in the markdown table below (up to two decimal places):
the estimate for the difference-in-means estimator, using the cleaned dataset (not a bootstrap sample).
the bootstrapped 95% confidence interval (CI) for the difference-in-means estimator
the length of the confidence interval, which is the difference between the upper and lower bounds of the confidence interval
Tip
See the comments in the code cell below for guidance on how you might want to structure your analysis, and feel free to create helper functions to organize your code. You do not have to follow the comments exactly, they are just a suggestion!
# TODO get point estimates using the cleaned dataset
point_estimates = {
"unstratified": 0,
"stratified_nnhealth": 0,
"stratified_momage": 0,
"stratified_momwhite": 0
}
# TODO generate bootstrap dataframes
# TODO create a data structure to store difference-in-means results
# TODO compute difference-in-means for each bootstrapped dataframe
# TODO report the results for each estimator
Estimator |
Effect |
95% CI |
Length of CI |
---|---|---|---|
Difference-in-means |
TODO |
TODO |
TODO |
Difference-in-means stratified by |
TODO |
TODO |
TODO |
Difference-in-means stratified by |
TODO |
TODO |
TODO |
Difference-in-means stratified by |
TODO |
TODO |
TODO |
5.2 Interpretation#
Finally, letās interpret the results.
Do any of the estimators appear to disagree with the others in terms of the estimate? Which estimator produces the shortest confidence interval?
Based on your analysis, what conclusions can you draw about the average treatment effect of the intervention? Does it improve or worsen the outcome?
Discuss any ethical/design considerations of the experimental setup you noticed (~1-2 brief paragraphs). You can refer back to our discussion of these challenges from class and the study design described in the reading. Some questions you could consider:
At the time the study was designed, do you think there was genuine uncertainty about the effectiveness of the intervention? Why might researchers have considered the intervention promising but unproven?
What do you notice about how infants were recruited into the study? Are they representative of the overall U.S. population of underweight newborns?
Is the selected outcome appropriate for evaluating the interventionās impact? What are some other outcomes that could be used to evaluate the intervention?
TODO your responses here:
6. Reflection#
How much time did you spend on this assignment?
Were there any parts of the assignment that you found particularly challenging?
What is one thing you have a better understanding of after completing this assignment and going through the class content?
Do you have any follow-up questions about concepts that youād like to explore further?
TODO your responses here:
7. Optional extensions#
If youād like, here are some optional extensions you can explore and implement:
Our stratified estimator currently only works for stratifying on binary variables. A natural extension would be to modify our
diff_in_means_stratified
function to work for multiple strata.Once our stratified difference in means estimate is extended to support multiple strata, we can explore the trade-off between using more or fewer strata for continuous variables, and seeing how this affects the variance of our estimator. This can be done by using the
pd.cut
function to bin the covariate into a discrete number of bins.Another benefit of a randomized experiment is that we can use them to study multiple outcomes. Brooks-Gunn et al. 1991 mention that the IQ score only provide a global estimate of cognitive function, and so also conduct analysis on the Peabody Picture Vocabulary Test (PPVT) which assesses vocabulary. We can follow the causal roadmap again to analyze this outcome, represented as the
ppvt.imp
column in the IHDP dataset.