Project 1 Part 2: Analysis#

Randomized Experiments

—TODO your name here

Collaboration Statement

  • TODO brief statement on the nature of your collaboration.

  • TODO your collaborator’s names here

Part 2 Table of Contents and Rubric#

Note

Grading guidelines

The course projects offer an opportunity to practice the full causal inference workflow, from building estimators and formulating questions to conducting analyses and communicating your findings effectively. Here are some guidelines on how to approach the project:

  • For visualizations:

    • Help your reader understand your findings through visually clear figures

    • Label your axes, provide legends when appropriate, and add figure titles to provide context

  • For written responses:

    • Support your ideas with specific evidence from your analysis or prior knowledge

    • Write concisely but clearly – one-word/one-phrase answers usually don’t give enough space to show what you’ve learned

If you’re uncertain about any portion of the project, please do come to office hours, TA hours, or reach out on Ed!

Notebook and function imports#

Tip

If you click on the vertical blue bar on the left of a cell, you can collapse the code which can help organize the notebook as you work through the project.

If you have tested your implementation in Part 1 against the autograder, you would have generated a file called proj1_functions.py. Let’s now import those functions into this notebook for use in Part 2.

If you are running this notebook on the JupyterHub allocated for the course:

  1. Open the file browser by going to the menu bar “View -> File Browser”

  2. Navigate to comsc341cd.github.io/projects/, you should see your proj1_analysis.ipynb file in that folder

  3. Click on the upload button in the upper right and upload the proj1_functions.py file to this directory

  4. Run the following cell to import the functions

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from proj1_functions import *

rng = np.random.RandomState(42)

4. The Infant Health and Development Program#

Let’s now dive into analyzing a real-world study: the Infant Health and Development Program (IHDP). The IHDP was a randomized experiment conducted in the 1980s to evaluate the effectiveness of an early intervention program for low-birthweight, prematurely born infants.

4.1 Prior Knowledge and Causal Question#

Read the following resources on the IHDP study:

Content warning

The reading contains the term “mental retardation,” which was commonly used in medical and policy discussions at the time but is now outdated and harmful. Today, the term “intellectual disability” is preferred and is used medically and federally.

Reading Notes

  • The Brooks-Gunn et al. 1991 paper contains minute details about the study recruitment and data collection process. Don’t worry about understanding every aspect of the study, but try to get a sense of the overall experimental design and the baseline characteristics reported in Table 1.

  • The original methods mention stratification by both birthweight and clinical site in the study design. Because the probability of treatment is the same for all stratified groups, both the stratified difference-in-means and unstratified difference-in-means are valid estimators. For the purposes of this project, we will analyze the efficiency of estimators that post-stratify by other covariates.

Answer the following questions:

  1. The outcome of interest we will be analyzing is the Stanford-Binet Intelligence Scale score (IQ) at an age of 36 months (3 years), and our causal quantity is the average treatment effect (ATE) on the child’s IQ score. Describe what causal question we are trying to answer in this dataset in terms of the causal quantity we are trying to identify.

  2. State what \(Y(1)\) and \(Y(0)\) are in your own words given the treatment and outcome of the study.

TODO your responses here:

4.2 Study Design#

Next, let’s load the IHDP dataset and look at the data:

ihdp_raw = pd.read_csv('~/COMSC-341CD/data/ihdp_real.csv')
ihdp_raw.info()

There are 50 total columns in the dataframe, but we will only use the following columns for the project:

  • bw: birth weight in grams

  • momage: mother’s age in years

  • nnhealth: neonatal health index, higher scores indicate better health

  • momwhite: binary indicator for maternal ethnicity: 1 if white, 0 otherwise

  • iqsb.36: outcome: Stanford-Binet IQ score at 36 months

  • treat: binary treatment assignment

Our study design is a randomized experiment, and we will assume that all of the assumptions that are needed to identify the average treatment effect hold. Answer the following questions:

  1. Suppose that our data did not come from a randomized experiment, and families were able to choose whether or not to participate in the intervention. Select two of the four variables listed above that you think may confound the relationship between the treatment and outcome, and explain why.

Your response: TODO

  1. Given the two variables you selected in the previous question, treatment, and outcome, draw a DAG with these four variables if this data were not collected from a randomized experiment. Give each node a variable name (e.g. \(T\), \(Y\), \(C_1\), \(C_2\)) and indicate what each variable represents.

TODO draw your DAG

  1. Draw the modified DAG from the previous question under the scenario where the treatment is randomized.

TODO draw your DAG

Submission note

For this question, you can either hand-draw the graphs on paper and submit a photo or use an online tool of your choice. Please make sure the image is submitted as a separate file on Gradescope.

4.3 Visualizing and cleaning the data#

Let’s next ensure that the baseline characteristics are balanced between the treatment and control groups. Complete the baseline_stats function below to verify that the bw, momage, nnhealth, and momwhite characteristics are balanced and roughly match the values reported in Table 1 of Brooks-Gunn et al 1991.

Tip

  • It is possible to complete this operation in one line by using pandas groupby() + agg()

  • Since momwhite is coded as a binary variable, the mean will be the proportion of momwhite = 1 in the sample.

def baseline_stats(df, covariate):
    """
    Computes the mean and standard deviation of a covariate 
    for the treatment and control groups.
    Assumes that the treatment indicator column is named `treat`.

    Args:
        df (pd.DataFrame): the dataframe to compute the baseline stats
        covariate (str): the name of the covariate to compute the baseline stats

    Returns:
        pd.DataFrame: a grouped dataframe with the mean and standard deviation 
            of the covariate for the treatment and control groups
    """
    # TODO your code here
    pass

# display(baseline_stats(ihdp_raw, 'bw'))
# display(baseline_stats(ihdp_raw, 'momage'))
# display(baseline_stats(ihdp_raw, 'nnhealth'))
# display(baseline_stats(ihdp_raw, 'momwhite'))

Beyond having similar mean and standard deviation, an implication of exchangeability is that the treatment and control groups should have a similar distribution of covariates. We can inspect this by plotting the (cumulative) distribution of the covariates for the treatment and control groups and visually inspecting whether the two distributions are similar.

Let’s verify this is the case for our continuous variables bw, momage, and nnhealth, using a distribution plot via sns.kdeplot. Specfically, complete the plot_covariate_distribution function below by generating a kde plot with:

  • the covariate on the x-axis

  • treat as the hue

  • ihdp_df as the data

  • common_norm=False to ensure that the y-axis is on the same scale for both the treatment and control groups

  • optional: cumulative=True to plot the cumulative distribution, to make the curves easier to compare

Note

For the statistically curious: a natural hypothesis to test here is whether the two datasets come from the same distribution, which can be done using a two-sample Kolmogorov-Smirnov test.

def plot_covariate_distribution(df, covariate, title):
    """
    Plots the (optionally cumulative) distribution of a covariate for the treatment and control groups.

    Args:
        df (pd.DataFrame): the dataframe to plot the covariate distribution
        covariate (str): the name of the covariate to plot
        title (str): the title of the plot
    
    Returns:
        None, but shows the covariate distribution plot
    """
    # TODO your code here
    plt.show()

#plot_covariate_distribution(ihdp_raw, 'bw', 'TODO title')
#plot_covariate_distribution(ihdp_raw, 'momage', 'TODO title')
#plot_covariate_distribution(ihdp_raw, 'nnhealth', 'TODO title')

Our outcome is iqsb.36, which are the infant’s IQ score at 36 months. However, there are some missing values in the iqsb.36 column, so we need to drop the rows that do not have the outcome. Update the ihdp_clean variable below to drop the rows that do not have the outcome.

# TODO your code here
ihdp_clean = None
assert ihdp_clean.shape[0] == ihdp_raw['iqsb.36'].notna().sum()

5. Estimation and Interpretation#

5.1 Difference-in-means estimation#

We’ll now want to apply our difference-in-means estimators to the IHDP dataset. In order to use our stratified difference-in-means estimator, we need to binarize the continuous covariates we are analyzing.

Apply the binarize_covariate function implmemented in part 1 to the momage and nnhealth covariates:

  • using a cutpoint of 22 years for momage

  • using a cutpoint of 100 for nnhealth

# TODO call binarize_covariate on the `momage` and `nnhealth` covariates for ihdp_clean

Finally, let’s compute our results. For the following difference-in-means estimators:

  • unstratified difference-in-means estimator for the iqsb.36 outcome

  • stratified difference-in-means estimator for the iqsb.36 outcome, binarized by momage

  • stratified difference-in-means estimator for the iqsb.36 outcome, binarized by nnhealth

  • stratified difference-in-means estimator for the iqsb.36 outcome, binarized by momwhite

Compute and report the following in the markdown table below (up to two decimal places):

  • the estimate for the difference-in-means estimator, using the cleaned dataset (not a bootstrap sample).

  • the bootstrapped 95% confidence interval (CI) for the difference-in-means estimator

  • the length of the confidence interval, which is the difference between the upper and lower bounds of the confidence interval

Tip

See the comments in the code cell below for guidance on how you might want to structure your analysis, and feel free to create helper functions to organize your code. You do not have to follow the comments exactly, they are just a suggestion!

# TODO get point estimates using the cleaned dataset
point_estimates = {
    "unstratified": 0,
    "stratified_nnhealth": 0,
    "stratified_momage": 0,
    "stratified_momwhite": 0
}

# TODO generate bootstrap dataframes

# TODO create a data structure to store difference-in-means results

# TODO compute difference-in-means for each bootstrapped dataframe

# TODO report the results for each estimator

Estimator

Effect

95% CI

Length of CI

Difference-in-means

TODO

TODO

TODO

Difference-in-means stratified by nnhealth_bin

TODO

TODO

TODO

Difference-in-means stratified by momage_bin

TODO

TODO

TODO

Difference-in-means stratified by momwhite

TODO

TODO

TODO

5.2 Interpretation#

Finally, let’s interpret the results.

  1. Do any of the estimators appear to disagree with the others in terms of the estimate? Which estimator produces the shortest confidence interval?

  2. Based on your analysis, what conclusions can you draw about the average treatment effect of the intervention? Does it improve or worsen the outcome?

  3. Discuss ethical/design considerations of the experimental setup you noticed along the three principles of the Belmont Report: 1 short paragraph (~2-4 sentences) per principle. You can refer back to our discussion of these challenges from class, and the study design described in the reading. Some questions you could (but don’t have to) consider in your responses:

    • At the time the study was designed, do you think there was genuine uncertainty about the effectiveness of the intervention? Why might researchers have considered the intervention promising but unproven?

    • What do you notice about how infants were recruited into the study? Are they representative of the overall U.S. population of underweight newborns?

    • Is the selected outcome appropriate for evaluating the intervention’s impact? What are some other outcomes that could be used to evaluate the intervention?

6. Reflection#

  1. How much time did you spend on this assignment?

  2. Were there any parts of the assignment that you found particularly challenging?

  3. What is one thing you have a better understanding of after completing this assignment and going through the class content?

  4. Do you have any follow-up questions about concepts that you’d like to explore further?

TODO your responses here:

7. Optional extensions#

If you’d like, here are some optional extensions you can explore and implement:

  • Our stratified estimator currently only works for stratifying on binary variables. A natural extension would be to modify our diff_in_means_stratified function to work for multiple strata.

  • Once our stratified difference in means estimate is extended to support multiple strata, we can explore the trade-off between using more or fewer strata for continuous variables, and seeing how this affects the variance of our estimator. This can be done by using the pd.cut function to bin the covariate into a discrete number of bins.

  • Another benefit of a randomized experiment is that we can use them to study multiple outcomes. Brooks-Gunn et al. 1991 mention that the IQ score only provide a global estimate of cognitive function, and so also conduct analysis on the Peabody Picture Vocabulary Test (PPVT) which assesses vocabulary. We can follow the causal roadmap again to analyze this outcome, represented as the ppvt.imp column in the IHDP dataset.