Project 1 Part 1: Functions#
Randomized Experiments
—TODO your name here
Collaboration Statement
TODO brief statement on the nature of your collaboration.
TODO your collaborator’s names here.
Tip
You’ll notice that the non-function cells of this notebook are contained in a if __name__ == "__main__":
block. This indicates to Jupyter that the code should only be run when the file is executed directly, not when it is imported as a module. This will make importing your functions easier in Part 2. If you write additional tests or cells that are not functions, make sure to add them within a if __name__ == "__main__":
block as well.
Part 1 Table of Contents and Rubric#
Section |
Points |
---|---|
Math to code: difference-in-means implementation |
2 |
Interactivity: stratified vs unstratified estimators |
2 |
Pandas and bootstrap utilities |
1 |
Total |
5 pts |
Notebook imports#
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, fixed, FloatSlider
1. Difference-in-means#
We begin by simulating a randomized experiment, where there are the following variables:
\(T\): the binary intervention of interest
\(Y\): the outcome of interest
\(X\): a binary variable that affects the outcome of interest, which we will refer to as a covariate
We provide the following code for simulating a random experiment with a binary treatment \(T\) and binary covariate \(X\):
rng = np.random.default_rng(seed=42)
def sim_random_exp(n_samples=1000, treatment_effect=1.0, covariate_effect=3.0):
"""
Simulate a random experiment with a binary treatment and covariate.
Args:
n_samples (int): the number of samples to simulate
treatment_effect (float): the magnitude of the effect of the treatment on the outcome
covariate_effect (float): the magnitude of the effect of the covariate on the outcome
Returns:
Y (np.ndarray): the observed outcome
T (np.ndarray): the binary treatment assignment
X (np.ndarray): the binary covariate
"""
# Generate potential outcomes
Y0 = rng.normal(size=n_samples)
Y1 = Y0 + treatment_effect
# Randomly assign treatment
T = rng.choice([0, 1], size=n_samples, )
# Create a binary covariate that affects the outcome
X = rng.choice([0, 1], size=n_samples)
# Generate the observed outcome
Y = np.where(T == 1, Y1, Y0) + covariate_effect*X
return Y, T, X
Note
The parameter covariate_effect
is the magnitude of the effect of the covariate on the outcome – the larger the value, the stronger the effect of the covariate on the outcome.
Under a randomized experiment study design, we are able to identify the average treatment effect (\(ATE\)):
We can then estimate the causal effect by taking the difference in means between the treated and control groups, which is known as the difference-in-means estimator:
where \(n_1\) is the number of samples in the treatment group, and \(n_0\) is the number of samples in the control group. That is:
1.1. Implement diff_in_means
[0.5 pts]#
def diff_in_means(Y, T):
"""
Computes the difference in means between the treatment and control groups.
Args:
Y (np.ndarray or pd.Series): the observed outcome
T (np.ndarray or pd.Series): the binary treatment assignment
Returns:
float: the difference in means estimate
"""
assert Y.shape == T.shape, "Y and T must have the same shape"
# TODO your code here
return 0
#### test diff_in_means ####
if __name__ == "__main__":
test_Y = np.array([2, 1, 2, 1])
test_T = np.array([1, 0, 1, 0])
assert np.isclose(diff_in_means(test_Y, test_T), 1)
# Feel free to add more tests
1.2. Implement stratified_diff_in_means
[1.5 pts]#
We saw in class that if we want to include a covariate in our analysis, we can analogously compute the difference in means between the treated and control groups, but now conditioning on the covariate.
We implement this here as a stratified difference-in-means estimator, which we call \(\widehat{ATE}_\text{stratified}\):
where \(K\) are the total number of strata, \(n_k\) is the number of samples in stratum \(k\), and \(\widehat{ATE}_k\) is the difference-in-means estimator for stratum \(k\). This is saying that we’re computing the difference in means for each stratum, and then taking a weighted average of the stratum-level estimates, where the weights are the proportion of samples in each stratum. \(\widehat{ATE}_k\) is defined as:
where \(n_{k1}\) is the number of samples in the treatment group for stratum \(k\), and \(n_{k0}\) is the number of samples in the control group for stratum \(k\). That is:
We’ll be implementing this estimator for the case when \(X\) is binary so there are only two strata, \(k \in \{0, 1\}\).
Hint
While these equations may look a bit complicated, notice that each individual summation term in \(\widehat{ATE}_k\) is just a mean of the outcome for a given subset of the data. In particular, in our case where \(X\) is binary:
\(\frac{1}{n_{11}} \sum_{i=1}^{n} \mathbb{I}(X_i = 1 \text{ and } T_i = 1) Y_i\) is the mean of the outcome when \(X=1\) and \(T=1\)
\(\frac{1}{n_{10}} \sum_{i=1}^{n} \mathbb{I}(X_i = 1 \text{ and } T_i = 0) Y_i\) is the mean of the outcome when \(X=1\) and \(T=0\)
and analogously for \(X=0\)
You can then use numpy’s boolean indexing to compute the means for each stratum.
def diff_in_means_stratified(Y, T, X):
"""
Compute the difference in means stratified by binary covariate X.
NOTE: stratified difference in means may be undefined if strata are empty.
Args:
Y (np.ndarray or pd.Series): the observed outcome
T (np.ndarray or pd.Series): the binary treatment assignment
X (np.ndarray or pd.Series): the binary covariate
Returns:
float: the difference in means estimate
"""
# TODO your code here
return 0
#### test diff_in_means_stratified ####
if __name__ == "__main__":
test_Y = np.array([2, 1, 2, 1])
test_T = np.array([1, 0, 1, 0])
test_X = np.array([1, 1, 0, 0])
assert np.isclose(diff_in_means_stratified(test_Y, test_T, test_X), 1)
# Feel free to add more tests
2. Exploring the efficiency of difference-in-means estimators#
2.1 Interactive widget#
Next, complete the gen_experiment_results
function to simulate n_experiments
random experiments and compute the difference-in-means and stratified difference-in-means estimates.
Tip
The function declaration below uses the **kwargs
syntax to pass additional keyword arguments to the sim_random_exp
function. This is a useful way to write functions that are flexible and can take in different sets of arguments. For more information on the **kwargs
syntax, see this tutorial.
def gen_experiment_results(n_experiments=10000, **kwargs):
"""
Simulates n_experiments random experiments and computes the difference-in-means
and stratified difference-in-means estimators.
Args:
n_experiments (int): the number of experiments to simulate
**kwargs: additional keyword arguments to pass directly `sim_random_exp`
Returns:
pd.DataFrame: a dataframe with the results of the experiments in each row
The dataframe should have two columns:
- diff_in_means: the difference in means estimator
- diff_in_means_stratified: the stratified difference in means estimator
"""
experiment_results = {
"diff_in_means": [],
"diff_in_means_stratified": []
}
for i in range(n_experiments):
# Passes **kwargs through to sim_random_exp to return Y, T, X
Y, T, X = sim_random_exp(**kwargs)
# TODO populate the experiment_results dictionary
# Converts dictionary to dataframe with keys as column names
exp_df = pd.DataFrame(experiment_results)
return exp_df
#### test gen_experiment_results ####
if __name__ == "__main__":
exp_results = gen_experiment_results(n_experiments=200, treatment_effect=2, covariate_effect=0)
assert exp_results.shape[0] == 200, "Incorrect number of experiments"
assert "diff_in_means" in exp_results.columns, "diff_in_means column is missing"
# feel free to add more tests
Using the gen_experiment_results
function, create a manual interactive widget that plots the both the stratified and unstratified difference-in-means for a range of covariate effects. Specifically, you should:
Add a
interact_manual
decorator to theplot_experiment_simulations
functionCreate a slider for the \(X\) variable effect that can take values between 0 and 6
Fix the number of experiments to 10000, see the tutorial reference on ipywidgets from Worksheet 3
Call the
gen_experiment_results
function with the given number of experiments and covariate effectPlot a histogram of the difference-in-means, unstratified and stratified for a given covariate effect: this should be a single plot with two histograms on top of each other, which can be achieved using
sns.histplot
Additionally, plot the true difference in means as a vertical line on the same figure, which can be achieved using
plt.axvline
. The default treatment effect is1.0
Note
Remember to follow good figure design practices by adding a legend, title, and axis labels.
# TODO add interact_manual decorator
def plot_experiment_simulations(num_experiments=10000, covariate_effect=3.0):
"""
Plots the both the stratified and unstratified difference-in-means
Args:
num_experiments (int): the number of experiments to simulate
covariate_effect (float): the covariate effect to use in the experiments
Returns:
None, but shows the plot
"""
# TODO your code here
#pass
2.2 Interpretation#
Recall our discussions of the variance of causal effect estimators:
Variance: how much do the estimates for a given experiment vary? We can visually inspect this by looking at how “spread out” the histogram is.
In addition to variance, researchers may also be interested in the statistical bias of an estimator:
Bias: how far off is the mean estimate off from the true value? We can visually inspect this by looking at where the histogram is “centered.” If the histogram is centered around the true value, then the estimator is unbiased, and if the histogram is not centered around the true value, then the estimator is said to be biased.
Play around with the widget you created above, testing out different covariate effect values.
What do you observe about the bias of the unstratified and stratified estimators? Is either biased?
What do you observe about the variance of the unstratified and stratified estimators? Researchers sometime call estimators with lower variance to be more efficient. Is there an estimator that appears to be more efficient?
What happens to the relative variance of the two estimators when you set the covariate effect to 0, and why you think this occurs?
TODO your responses here:
3. Pandas and bootstrap utilities#
3.1 Bootstrap [0.5 pts]#
Since we only have one real dataset in a causal study, we cannot generate a distribution of the difference-in-means estimator over multiple experiments. However, we can use bootstrapping to generate a distribution of the difference-in-means estimator.
As we saw in worksheet 3 and in class, bootstrapping is a powerful technique that allows us to generate a sampling distribution over any estimator by resampling the data with replacement. Copy your bootstrap_dfs
function from Worksheet 3 below, which takes in a dataframe and returns a list of bootstrapped dataframes.
def bootstrap_dfs(df, n_bootstraps=5000):
"""
Bootstraps the dataframe `n_bootstraps` times.
Args:
df (pd.DataFrame): the dataframe to bootstrap
n_bootstraps (int): the number of bootstraps to generate
Returns:
list[pd.DataFrame]: a list of bootstrapped dataframes
"""
# TODO your code here
pass
Next, we’ll write a utility function to compute the bootstrapped confidence interval given a list of bootstrap values, similar to how we did in Worksheet 3.
Note
This function takes in an alpha level of 0.05 by default, which corresponds to a 95% confidence interval. However, this implementation allows for any alpha level to be used, e.g. for 90% or 99% confidence intervals.
Also note that while the np.percentile
function takes in a percentage between 0 and 100, the alpha
parameter is a proportion between 0 and 1. Be sure to convert to correct units!
def bootstrap_ci(bootstrap_values, alpha=0.05):
"""
Computes the confidence interval using the percentile method.
Args:
bootstrap_values (list[float]): the bootstrapped values
alpha (float): the significance level, defaults to 0.05
Returns:
list[float]: the confidence interval [lower, upper]
"""
# TODO your code here
pass
### test bootstrap_ci ####
if __name__ == "__main__":
bootstrap_values = np.arange(0, 101)
cis = bootstrap_ci(bootstrap_values)
assert np.isclose(cis[0], 2.5), "Lower bound should be 2.5"
assert np.isclose(cis[1], 97.5), "Upper bound should be 97.5"
# feel free to add more tests
3.2 Pandas column processing [0.5 pts]#
In order to facilitate the use of our stratified difference-in-means estimator, we’ll need to be able to binarize a continuous covariate into two strata.
Complete the following binarize_covariate
function, which creates a new binary covariate that is 1 if the covariate is greater than the cutpoint, and 0 otherwise.
def binarize_covariate(df, covariate, cutpoint):
"""
Creates a new binary covariate that is 1 if the covariate is greater
than the cutpoint, and 0 otherwise.
Updates the dataframe with a new column named `{covariate}_bin`
Args:
df (pd.DataFrame): the dataframe to update
covariate (str): the name of the covariate to binarize
cutpoint (float): the cutpoint to use for binarization
Returns:
pd.DataFrame: the updated dataframe with the new binary covariate
"""
# TODO your code here
return df
#### test binarize_covariate ####
if __name__ == "__main__":
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
assert binarize_covariate(df, 'A', 2.5).equals(pd.DataFrame({'A': [1, 2, 3, 4, 5], 'A_bin': [0, 0, 1, 1, 1]})), "Binarization is incorrect"
# feel free to add more tests
How to submit
Like the worksheets, follow the instructions on the course website to submit your work. For part 1, you will submit proj1_functions.py
and proj1_functions.ipynb
.