Activity 7 Solution: Positivity and regression in Yeager et al. 2019#
2025-03-04
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
# load and examine the data
learning_df = pd.read_csv("~/COMSC-341CD/data/learning_mindset.csv")
learning_df.head()
schoolid | intervention | achievement_score | success_expect | ethnicity | gender | frst_in_family | school_urbanicity | school_mindset | school_achievement | school_ethnic_minority | school_poverty | school_size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 76 | 1 | 0.277359 | 6 | 4 | 2 | 1 | 4 | 0.334544 | 0.648586 | -1.310927 | 0.224077 | -0.426757 |
1 | 76 | 1 | -0.449646 | 4 | 12 | 2 | 1 | 4 | 0.334544 | 0.648586 | -1.310927 | 0.224077 | -0.426757 |
2 | 76 | 1 | 0.769703 | 6 | 4 | 2 | 0 | 4 | 0.334544 | 0.648586 | -1.310927 | 0.224077 | -0.426757 |
3 | 76 | 1 | -0.121763 | 6 | 4 | 2 | 0 | 4 | 0.334544 | 0.648586 | -1.310927 | 0.224077 | -0.426757 |
4 | 76 | 1 | 1.526147 | 6 | 4 | 1 | 0 | 4 | 0.334544 | 0.648586 | -1.310927 | 0.224077 | -0.426757 |
learning_df['achievement_score'].std()
1.0
This selected portion of the National Study of Learning Mindsets dataset is not truly randomized, so we’ll need to adjust for confounding.
The columns we will look at are:
intervention
: \(T\) whether the student received the intervention (1) or not (0)success_expect
: student prior mindset about their ability to succeed in school (higher values indicate a stronger belief in their ability to succeed)frst_in_family
: whether the student would be the first in their family to attend college (1) or not (0)gender
: student’s self-reported genderschool_urbanicity
: categorical variable corresponding to the urbanicity of the school the student attends, e.g. urban, suburban, ruralachievement_score
: \(Y\) the student’s future grade achievement, standardized such that 0 is the mean and it has a standard deviation of 1
Part 1: Imbalance in covariates#
In this vesrion of the dataset we are anlayzing, there appear to be some differences in the distribution of covariates between the treatment and control groups where participants have different probabilities of receiving the intervention.
1.1#
Perform two separate groupby
operations to compute the mean of intervention
, grouping by:
success_expect
frst_in_family
What do you observe? Are certain groups of students more or less likely to receive the growth mindset video? Keep in mind that intervention=1
corresponds to the growth mindset video intervention.
Your response: TODO
# TODO perform two separate groupby operations
print(learning_df.groupby(['success_expect'])['intervention'].mean())
print(learning_df.groupby(['frst_in_family'])['intervention'].mean())
success_expect
1 0.271739
2 0.265957
3 0.294118
4 0.271617
5 0.311070
6 0.354287
7 0.362319
Name: intervention, dtype: float64
frst_in_family
0 0.353325
1 0.309487
Name: intervention, dtype: float64
Part 2: Examining positivity#
covariates = ['success_expect', 'frst_in_family']
2.1#
Now that we’ve seeen some potential confounding in success_expect
and frst_in_family
, let’s try to control for them. If we take the same strategy as we have done before with stratification, we’ll need to bin on the confounders and compute treatment effects for each bin.
However, we also need to be careful about positivity violations. First, let’s compute the total number of bins we need to create if we want to control for these two covariates.
We can do this by using pd.Series.nunique to get the number of unique values for each covariate and then multiplying them together. This is like taking a cross product over the all possible values of each variable.
learning_df['frst_in_family'].nunique()
2
# TODO calculate the total number of bins
total_bins = learning_df['frst_in_family'].nunique() * learning_df['success_expect'].nunique()
print(f"Total number of bins: {total_bins}")
Total number of bins: 14
2.2#
Next, let’s check if positivity holds. We can do this by grouping over the covariiats plus the intervention, and then counting the number of unique groups are actually present in the data.
To generate the per-bin counts, we perform a groupby(all_cols, as_index=False)
over the intervention and all combinations of the other columns, and the check the ngroups
attribute of the resulting groupby object. How many groups are there?
# Group by the intervention column and the two covariates
all_cols = ['intervention', 'success_expect', 'frst_in_family']
group_count = learning_df.groupby(all_cols, as_index=False).ngroups
print(f"Number actual groups among the bins for {all_cols}: {group_count}")
Number actual groups among the bins for ['intervention', 'success_expect', 'frst_in_family']: 28
Since we need each bin to have both control and treatment units in order to have a valid comparison, the total number of groups should be equal to 2 times the total number of bins possible for there to be no positivity violations.
Does the number of groups you found in 2.1 match this?
Your response: TODO
2.3#
Ideally we’d like to control for as many confounders as possible to make conditional exchangeability more plausible. Let’s now add gender
and school_urbanicity
to our list of covariates, making a total of 4 confounders.
Repeat the analysis above with the new set of covariates. Do we see positivity violations with the new set of covariates?
Your response: TODO
# TODO your code here
all_cols = ['intervention', 'success_expect', 'gender', 'frst_in_family', 'school_urbanicity',]
group_count = learning_df.groupby(all_cols, as_index=False).ngroups
print(f"Number actual groups among the bins for {all_cols}: {group_count}")
expected_count = np.prod([learning_df[col].nunique() for col in all_cols[1:]]) * 2
print(f"Expected number of groups for positivity to hold: {expected_count}")
Number actual groups among the bins for ['intervention', 'success_expect', 'gender', 'frst_in_family', 'school_urbanicity']: 264
Expected number of groups for positivity to hold: 280
Part 3: Regression for adjustment estimation#
In Parts 1 and 2, we observed covariate imbalances and potential positivity violations when trying to use stratification for adjustment. Let’s now use regression as an alternative approach for estimating the causal effect while controlling for confounders.
3.1#
Regression can be used to estimate causal effects when we have conditional exchangeability. If we assume that we have measured all confounders and included them in our regression model, then the coefficient on the treatment variable can be interpreted as the average treatment effect (ATE).
Let’s first import the statsmodels formula API:
# canonical import for the formula API
import statsmodels.formula.api as smf
First, let’s fit a naive regression model that doesn’t adjust for any confounders. We’ll use the following formula:
formula = 'outcome ~ 1 + treatment' # equivalent to outcome = beta_0 + beta_1 treatment
This formula tells statsmodels to fit a model where the outcome is regressed on the treatment variable and the intercept (1). We then pass that string to the smf.ols
function, along with the data and the model formula:
model = smf.ols(formula, data=data).fit()
The parameter estimates are stored in the params
attribute of the model object:
model.params
Fit a naive regression model on the intervention
and achievement_score
variables. What is the fitted coefficient for intervention
?
Your response: TODO
learning_df.columns
Index(['schoolid', 'intervention', 'achievement_score', 'success_expect',
'ethnicity', 'gender', 'frst_in_family', 'school_urbanicity',
'school_mindset', 'school_achievement', 'school_ethnic_minority',
'school_poverty', 'school_size'],
dtype='object')
# TODO fit a naive rgression model and check the params attribute
formula = 'achievement_score ~ 1 + intervention'
model = smf.ols(formula, data=learning_df).fit()
model.params
Intercept -0.153803
intervention 0.472272
dtype: float64
3.2#
Next, let’s fit a model that adjusts for the confounders we identified in Part 1.
We can do this by adding the confounders to the formula:
formula = 'outcome ~ 1 + treatment + confounder1 + confounder2'
How does the coefficient on intervention
change when we adjust for the confounders?
Your response: TODO
# TODO fit a model that adjusts for the confounders and check the params attribute
formula = 'achievement_score ~ 1 + intervention + success_expect + frst_in_family'
model = smf.ols(formula, data=learning_df).fit()
model.params
Intercept -2.023493
intervention 0.414097
success_expect 0.373244
frst_in_family -0.123081
dtype: float64
Note
We’ll have more opportunities to practice with statsmodels and linear regression in Worksheet 4!
Optional extra#
if we actually want to see the bins that are missing, we can generate a pivot_table of the counts, and then identify the bins that are missing in either the control or treatment group.
all_cols = ['success_expect', 'gender', 'frst_in_family', 'school_urbanicity', 'intervention']
count_df = learning_df.groupby(all_cols, as_index=False).size()
# Create a pivot table to show counts by intervention and bins
bin_pivot = pd.pivot_table(
count_df,
index=['success_expect', 'gender', 'frst_in_family', 'school_urbanicity'],
columns=['intervention'],
values='intervention',
fill_value=0
)
# Display information about the pivot table
print("Bins with no control units:")
display(bin_pivot[bin_pivot[0] == 0])
print("Bins with no treatment units:")
display(bin_pivot[bin_pivot[1] == 0])
Bins with no control units:
intervention | 0 | 1 | |||
---|---|---|---|---|---|
success_expect | gender | frst_in_family | school_urbanicity | ||
1 | 1 | 0 | 1 | 0.0 | 2.0 |
2 | 1 | 0 | 1 | 0.0 | 2.0 |
Bins with no treatment units:
intervention | 0 | 1 | |||
---|---|---|---|---|---|
success_expect | gender | frst_in_family | school_urbanicity | ||
1 | 1 | 0 | 0 | 1.0 | 0.0 |
2 | 2.0 | 0.0 | |||
1 | 0 | 3.0 | 0.0 | ||
2 | 0 | 0 | 2.0 | 0.0 | |
1 | 1.0 | 0.0 | |||
2 | 1.0 | 0.0 | |||
2 | 1 | 0 | 0 | 2.0 | 0.0 |
3 | 2.0 | 0.0 | |||
1 | 1 | 10.0 | 0.0 | ||
2 | 0 | 3 | 4.0 | 0.0 | |
3 | 1 | 0 | 0 | 5.0 | 0.0 |
2 | 11.0 | 0.0 |
References#
Yeager, D. S. et al. (2019). A national experiment reveals where a growth mindset improves achievement. Nature.
Athey, S., & Wager, S. (2019). Estimating treatment effects with causal forests: An application. Observational studies.
Facure, M. (2023). Causal Inference for the Brave and the True.