Project 1 Part 1: Functions

Project 1 Part 1: Functions#

Randomized Experiments

—TODO your name here

Collaboration Statement

TODO brief statement on the nature of your collaboration.
TODO your collaborator’s names here.

Tip

You’ll notice that the non-function cells of this notebook are contained in a if __name__ == "__main__": block. This indicates to Jupyter that the code should only be run when the file is executed directly, not when it is imported as a module. This will make importing your functions easier in Part 2. If you write additional tests or cells that are not functions, make sure to add them within a if __name__ == "__main__": block as well.

Part 1 Table of Contents and Rubric#

Section	Points
Math to code: difference-in-means implementation	2
Interactivity: stratified vs unstratified estimators	2
Pandas and bootstrap utilities	1
Total	5 pts

Notebook imports#

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, fixed, FloatSlider

1. Difference-in-means#

We begin by simulating a randomized experiment, where there are the following variables:

\(T\): the binary intervention of interest
\(Y\): the outcome of interest
\(X\): a binary variable that affects the outcome of interest, which we will refer to as a covariate

We provide the following code for simulating a random experiment with a binary treatment \(T\) and binary covariate \(X\):

rng = np.random.default_rng(seed=42)

def sim_random_exp(n_samples=1000, treatment_effect=1.0, covariate_effect=3.0):
    """
    Simulate a random experiment with a binary treatment and covariate.

    Args:
        n_samples (int): the number of samples to simulate
        treatment_effect (float): the magnitude of the effect of the treatment on the outcome
        covariate_effect (float): the magnitude of the effect of the covariate on the outcome

    Returns:
        Y (np.ndarray): the observed outcome
        T (np.ndarray): the binary treatment assignment
        X (np.ndarray): the binary covariate
    """
    # Generate potential outcomes
    Y0 = rng.normal(size=n_samples)
    Y1 = Y0 + treatment_effect

    # Randomly assign treatment
    T = rng.choice([0, 1], size=n_samples, )

    # Create a binary covariate that affects the outcome
    X = rng.choice([0, 1], size=n_samples)

    # Generate the observed outcome
    Y = np.where(T == 1, Y1, Y0) + covariate_effect*X

    return Y, T, X

Note

The parameter covariate_effect is the magnitude of the effect of the covariate on the outcome – the larger the value, the stronger the effect of the covariate on the outcome.

Under a randomized experiment study design, we are able to identify the average treatment effect (\(ATE\)):

\[ ATE = E[Y(1) - Y(0)] \; \xrightarrow[]{\text{Identification}} \; E[Y | T = 1] - E[Y | T = 0] \]

We can then estimate the causal effect by taking the difference in means between the treated and control groups, which is known as the difference-in-means estimator:

\[\begin{split} \begin{align*} E[Y | T=1] - E[Y | T=0] \; \xrightarrow[]{\text{Estimation}} \; &\hat{E}[Y | T=1] - \hat{E}[Y | T=0] \\ = &\frac{1}{n_1} \sum_{i=1}^{n} \mathbb{I}(T_i = 1) Y_i - \frac{1}{n_0} \sum_{i=1}^{n} \mathbb{I}(T_i = 0) Y_i \end{align*} \end{split}\]

where \(n_1\) is the number of samples in the treatment group, and \(n_0\) is the number of samples in the control group. That is:

\[\begin{split} n_0 = \sum_{i=1}^{n} \mathbb{I}(T_i = 0)\\ n_1 = \sum_{i=1}^{n} \mathbb{I}(T_i = 1) \end{split}\]

1.1. Implement `diff_in_means` [0.5 pts]#

def diff_in_means(Y, T):
    """
    Computes the difference in means between the treatment and control groups.

    Args:
        Y (np.ndarray or pd.Series): the observed outcome
        T (np.ndarray or pd.Series): the binary treatment assignment

    Returns:
        float: the difference in means estimate
    """
    assert Y.shape == T.shape, "Y and T must have the same shape"
    
    # TODO your code here
    return 0

#### test diff_in_means ####
if __name__ == "__main__":
    test_Y = np.array([2, 1, 2, 1])
    test_T = np.array([1, 0, 1, 0])

    assert np.isclose(diff_in_means(test_Y, test_T), 1)
    # Feel free to add more tests

1.2. Implement `stratified_diff_in_means` [1.5 pts]#

We saw in class that if we want to include a covariate in our analysis, we can analogously compute the difference in means between the treated and control groups, but now conditioning on the covariate.

We implement this here as a stratified difference-in-means estimator, which we call \(\widehat{ATE}_\text{stratified}\):

\[ \widehat{ATE}_\text{stratified} = \sum^K_{k=1} \frac{n_k}{n} \widehat{ATE}_k \]

where \(K\) are the total number of strata, \(n_k\) is the number of samples in stratum \(k\), and \(\widehat{ATE}_k\) is the difference-in-means estimator for stratum \(k\). This is saying that we’re computing the difference in means for each stratum, and then taking a weighted average of the stratum-level estimates, where the weights are the proportion of samples in each stratum. \(\widehat{ATE}_k\) is defined as:

\[ \widehat{ATE}_k = \frac{1}{n_{k1}} \sum_{i=1}^{n} \mathbb{I}(X_i = k \text{ and } T_i = 1) Y_i - \frac{1}{n_{k0}} \sum_{i=1}^{n} \mathbb{I}(X_i = k \text{ and } T_i = 0) Y_i \]

where \(n_{k1}\) is the number of samples in the treatment group for stratum \(k\), and \(n_{k0}\) is the number of samples in the control group for stratum \(k\). That is:

\[\begin{split} n_{k1} = \sum_{i=1}^{n} \mathbb{I}(X_i = k \text{ and } T_i = 1)\\ n_{k0} = \sum_{i=1}^{n} \mathbb{I}(X_i = k \text{ and } T_i = 0) \end{split}\]

We’ll be implementing this estimator for the case when \(X\) is binary so there are only two strata, \(k \in \{0, 1\}\).

Hint

While these equations may look a bit complicated, notice that each individual summation term in \(\widehat{ATE}_k\) is just a mean of the outcome for a given subset of the data. In particular, in our case where \(X\) is binary:

\(\frac{1}{n_{11}} \sum_{i=1}^{n} \mathbb{I}(X_i = 1 \text{ and } T_i = 1) Y_i\) is the mean of the outcome when \(X=1\) and \(T=1\)
\(\frac{1}{n_{10}} \sum_{i=1}^{n} \mathbb{I}(X_i = 1 \text{ and } T_i = 0) Y_i\) is the mean of the outcome when \(X=1\) and \(T=0\)
and analogously for \(X=0\)

You can then use numpy’s boolean indexing to compute the means for each stratum.

def diff_in_means_stratified(Y, T, X):
    """
    Compute the difference in means stratified by binary covariate X.

    NOTE: stratified difference in means may be undefined if strata are empty.

    Args:
        Y (np.ndarray or pd.Series): the observed outcome
        T (np.ndarray or pd.Series): the binary treatment assignment
        X (np.ndarray or pd.Series): the binary covariate

    Returns:
        float: the difference in means estimate
    """
    
    # TODO your code here
    return 0

#### test diff_in_means_stratified ####
if __name__ == "__main__":
    test_Y = np.array([2, 1, 2, 1])
    test_T = np.array([1, 0, 1, 0])
    test_X = np.array([1, 1, 0, 0])

    assert np.isclose(diff_in_means_stratified(test_Y, test_T, test_X), 1)
    # Feel free to add more tests

2. Exploring the efficiency of difference-in-means estimators#

2.1 Interactive widget#

Next, complete the gen_experiment_results function to simulate n_experiments random experiments and compute the difference-in-means and stratified difference-in-means estimates.

Tip

The function declaration below uses the **kwargs syntax to pass additional keyword arguments to the sim_random_exp function. This is a useful way to write functions that are flexible and can take in different sets of arguments. For more information on the **kwargs syntax, see this tutorial.

def gen_experiment_results(n_experiments=10000, **kwargs):
    """
    Simulates n_experiments random experiments and computes the difference-in-means
    and stratified difference-in-means estimators.

    Args:
        n_experiments (int): the number of experiments to simulate
        **kwargs: additional keyword arguments to pass directly `sim_random_exp`

    Returns:
        pd.DataFrame: a dataframe with the results of the experiments in each row
        
        The dataframe should have two columns:
        - diff_in_means: the difference in means estimator
        - diff_in_means_stratified: the stratified difference in means estimator
    """
    experiment_results = {
        "diff_in_means": [],
        "diff_in_means_stratified": []
    }

    for i in range(n_experiments):
        # Passes **kwargs through to sim_random_exp to return Y, T, X
        Y, T, X = sim_random_exp(**kwargs)

        # TODO populate the experiment_results dictionary

    # Converts dictionary to dataframe with keys as column names
    exp_df = pd.DataFrame(experiment_results)

    return exp_df

#### test gen_experiment_results ####
if __name__ == "__main__":
    exp_results = gen_experiment_results(n_experiments=200, treatment_effect=2, covariate_effect=0)

    assert exp_results.shape[0] == 200, "Incorrect number of experiments"
    assert "diff_in_means" in exp_results.columns, "diff_in_means column is missing"
    # feel free to add more tests

Using the gen_experiment_results function, create a manual interactive widget that plots the both the stratified and unstratified difference-in-means for a range of covariate effects. Specifically, you should:

Add a interact_manual decorator to the plot_experiment_simulations function
Create a slider for the \(X\) variable effect that can take values between 0 and 6
Fix the number of experiments to 10000, see the tutorial reference on ipywidgets from Worksheet 3
Call the gen_experiment_results function with the given number of experiments and covariate effect
Plot a histogram of the difference-in-means, unstratified and stratified for a given covariate effect: this should be a single plot with two histograms on top of each other, which can be achieved using sns.histplot
Additionally, plot the true difference in means as a vertical line on the same figure, which can be achieved using plt.axvline. The default treatment effect is 1.0

Note

Remember to follow good figure design practices by adding a legend, title, and axis labels.

# TODO add interact_manual decorator

def plot_experiment_simulations(num_experiments=10000, covariate_effect=3.0):
    """
    Plots the both the stratified and unstratified difference-in-means 

    Args:
        num_experiments (int): the number of experiments to simulate
        covariate_effect (float): the covariate effect to use in the experiments

    Returns:
        None, but shows the plot
    """

    # TODO your code here
    #pass

2.2 Interpretation#

Recall our discussions of the variance of causal effect estimators:

Variance: how much do the estimates for a given experiment vary? We can visually inspect this by looking at how “spread out” the histogram is.

In addition to variance, researchers may also be interested in the statistical bias of an estimator:

Bias: how far off is the mean estimate off from the true value? We can visually inspect this by looking at where the histogram is “centered.” If the histogram is centered around the true value, then the estimator is unbiased, and if the histogram is not centered around the true value, then the estimator is said to be biased.

Play around with the widget you created above, testing out different covariate effect values.

What do you observe about the bias of the unstratified and stratified estimators? Is either biased?
What do you observe about the variance of the unstratified and stratified estimators? Researchers sometime call estimators with lower variance to be more efficient. Is there an estimator that appears to be more efficient?
What happens to the relative variance of the two estimators when you set the covariate effect to 0, and why you think this occurs?

TODO your responses here:

3. Pandas and bootstrap utilities#

3.1 Bootstrap [0.5 pts]#

Since we only have one real dataset in a causal study, we cannot generate a distribution of the difference-in-means estimator over multiple experiments. However, we can use bootstrapping to generate a distribution of the difference-in-means estimator.

As we saw in worksheet 3 and in class, bootstrapping is a powerful technique that allows us to generate a sampling distribution over any estimator by resampling the data with replacement. Copy your bootstrap_dfs function from Worksheet 3 below, which takes in a dataframe and returns a list of bootstrapped dataframes.

def bootstrap_dfs(df, n_bootstraps=5000):
    """
    Bootstraps the dataframe `n_bootstraps` times.

    Args:
        df (pd.DataFrame): the dataframe to bootstrap
        n_bootstraps (int): the number of bootstraps to generate

    Returns:
        list[pd.DataFrame]: a list of bootstrapped dataframes
    """
    # TODO your code here
    pass

Next, we’ll write a utility function to compute the bootstrapped confidence interval given a list of bootstrap values, similar to how we did in Worksheet 3.

Note

This function takes in an alpha level of 0.05 by default, which corresponds to a 95% confidence interval. However, this implementation allows for any alpha level to be used, e.g. for 90% or 99% confidence intervals.

Also note that while the np.percentile function takes in a percentage between 0 and 100, the alpha parameter is a proportion between 0 and 1. Be sure to convert to correct units!

def bootstrap_ci(bootstrap_values, alpha=0.05):
    """
    Computes the confidence interval using the percentile method.

    Args:
        bootstrap_values (list[float]): the bootstrapped values
        alpha (float): the significance level, defaults to 0.05

    Returns:
        list[float]: the confidence interval [lower, upper]
    """

    # TODO your code here
    pass

### test bootstrap_ci ####
if __name__ == "__main__":
    bootstrap_values = np.arange(0, 101)
    cis = bootstrap_ci(bootstrap_values)
    assert np.isclose(cis[0], 2.5), "Lower bound should be 2.5"
    assert np.isclose(cis[1], 97.5), "Upper bound should be 97.5"
    # feel free to add more tests

3.2 Pandas column processing [0.5 pts]#

In order to facilitate the use of our stratified difference-in-means estimator, we’ll need to be able to binarize a continuous covariate into two strata.

Complete the following binarize_covariate function, which creates a new binary covariate that is 1 if the covariate is greater than the cutpoint, and 0 otherwise.

def binarize_covariate(df, covariate, cutpoint):
    """
    Creates a new binary covariate that is 1 if the covariate is greater 
    than the cutpoint, and 0 otherwise.
    Updates the dataframe with a new column named `{covariate}_bin`

    Args:
        df (pd.DataFrame): the dataframe to update
        covariate (str): the name of the covariate to binarize
        cutpoint (float): the cutpoint to use for binarization

    Returns:
        pd.DataFrame: the updated dataframe with the new binary covariate
    """
    # TODO your code here
    return df

#### test binarize_covariate ####
if __name__ == "__main__":
    df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
    assert binarize_covariate(df, 'A', 2.5).equals(pd.DataFrame({'A': [1, 2, 3, 4, 5], 'A_bin': [0, 0, 1, 1, 1]})), "Binarization is incorrect"
    # feel free to add more tests

How to submit

Like the worksheets, follow the instructions on the course website to submit your work. For part 1, you will submit proj1_functions.py and proj1_functions.ipynb.

Project 1 Part 1: Functions

Contents

Project 1 Part 1: Functions#

Part 1 Table of Contents and Rubric#

Notebook imports#

1. Difference-in-means#

1.1. Implement diff_in_means [0.5 pts]#

1.2. Implement stratified_diff_in_means [1.5 pts]#

2. Exploring the efficiency of difference-in-means estimators#

2.1 Interactive widget#

2.2 Interpretation#

3. Pandas and bootstrap utilities#

3.1 Bootstrap [0.5 pts]#

3.2 Pandas column processing [0.5 pts]#

1.1. Implement `diff_in_means` [0.5 pts]#

1.2. Implement `stratified_diff_in_means` [1.5 pts]#