Activity 8 Solution: Extrapolation, overlap, and propensity scores#

2025-03-06


import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

from ipywidgets import interact_manual
from ipywidgets import widgets


rng = np.random.default_rng(42)

Part 1: Distribution of covariates under true randomization#

On Tuesday we saw that the success_expect covariate was associated with the intervention. What should the distribution of success_expect look like under true randomization (coin flips)? And does it matter if the coin flips are “biased” (not 50% heads, 50% tails)?

First, let’s simulate a distribution of success_expect under true randomization. Begin by generating some data that represents success_expect, which takes on integer values between 1 and 7. Use the rng.choice function to generate 10,000 samples from a uniform distribution over the integers 1 through 7.

n_samples = 10000
# sim_success_expect = TODO call rng.choice with the a and size parameters

### BEGIN SOLUTION
sim_success_expect = rng.choice(a=np.arange(1, 8), size=n_samples)
### END SOLUTION

Next, generate random treatment assignments for each sample using the rng.choice function. For now, assume the probability of treatment is 0.5, that is:

\[ P(T=1) = 0.5 \]
p_treat = 0.5
# p needs to be a list of probabilities for each outcome: [0, 1]
p = [1-p_treat, p_treat]
# sim_treat = TODO call rng.choice with the a, size, and p parameters

### BEGIN SOLUTION
sim_treat = rng.choice(a=[0,1], p=p, size=n_samples)
### END SOLUTION

Let’s now plot the distribution of treatment for each sim_success_expect level. We’ll do this by creating a dataframe with the two columns sim_success_expect and sim_treat, and then using sns.barplot to plot the distribution of sim_treat for each sim_success_expect level.

Call sns.barplot with the x and y parameters set to sim_success_expect and sim_treat, respectively.

sim_df = pd.DataFrame({
    'sim_success_expect': sim_success_expect, 
    'sim_treat': sim_treat
})

# TODO sns.barplot

### BEGIN SOLUTION
sns.barplot(x='sim_success_expect', y='sim_treat', data=sim_df)
plt.axhline(y=p_treat, color='black', linestyle='--', label='True probability of treatment')
plt.legend()
### END SOLUTION
<matplotlib.legend.Legend at 0x17f3ac6d0>
../_images/30d473f2f857ea4a9d3a4e24ed6ade1d5e5b2cfdf9c301d9ac9a4b4072ad5226.png

The black vertical lines on each bar represnt a bootstrapped confidence interval for the proportion of treatment in each sim_success_expect level. On your plot above, add a horizontal line using plt.axhline at y=p_treat to represent the true probability of treatment. The observed probability of treatment shouldn’t deviate significantly from the true probability of treatment for any of the sim_success_expect values.

Finally, let’s explore different different probabilities of treatment. Copy your code from the above three cells that created your barplot into the plot_treatment_distribution function below, and add an interact_manual decorator that lets you change p_treat.

Does the probability of treatment change the distribution of treated units among the sim_success_expect values? Briefly comment on why/why not.

Your response: pollev.com​/tliu

# TODO an interact_manual
def plot_treatment_distribution(p_treat):
    pass

rng = np.random.default_rng(42)


# make this a slider, either fomulation works
@interact_manual(p_treat=(0, 1, 0.1))
#@interact_manual(p_treat=widgets.FloatSlider(value=0.5, min=0, max=1, step=0.1))
def plot_treatment_distribution(p_treat):
    ### BEGIN SOLUTION
    sim_success_expect = rng.choice(a=np.arange(1, 8), size=n_samples)

    # p needs to be a list of probabilities for each outcome: [0, 1]
    p = [1-p_treat, p_treat]
    sim_treat = rng.choice(a=[0,1], p=p, size=n_samples)

    sim_df = pd.DataFrame({
    'sim_success_expect': sim_success_expect, 
    'sim_treat': sim_treat
    })

    sns.barplot(x='sim_success_expect', y='sim_treat', data=sim_df)
    plt.axhline(y=p_treat, ls='--', color='red')
    ### END SOLUTION

Part 2: Fitting a Propensity Score#

Let’s now estimate a propensity score for the learning mindsets data. As a reminder from Tuesday’s lecture, the columns we’ll look at are:

  • intervention: whether the student received the intervention (1) or not (0)

  • success_expect: student prior mindset about their ability to succeed in school (higher values indicate a stronger belief in their ability to succeed)

  • frst_in_family: whether the student would be the first in their family to attend college (1) or not (0)

  • gender: student’s self-reported gender

  • school_urbanicity: categorical variable corresponding to the urbanicity of the school the student attends, e.g. urban, suburban, rural

  • achievement_score: the student’s future grade achievement, standardized such that 0 is the mean and it has a standard deviation of 1

The way we’ll do this is by fitting a logistic regression model, which allows us estimate the following:

\[ P(T=1 \mid X) = E[T \mid X] = e(X) \]

Like with linear regression, this can extend to multiple covariates, such as:

\[ P(T=1 \mid X_1, X_2, X_3, X_4) = E[T \mid X_1, X_2, X_3, X_4] \]

In statmodels, we just need to specify the formula string 'T ~1 + X1 + X2 + X3 + X4' and pass the dataframe to smf.logit instead of smf.ols to fit a logistic regression model:

model = smf.logit('T ~ 1 + X1 + X2 + X3 + X4', data=data).fit()

Fit a propensity score model predicting intervention and our four covariates from Tuesday’s lecture (success_expect, gender, frst_in_family, school_urbanicity).

One additional consideration is that gender and school_urbanicity are categorical variables. We can indicate to smf.logit that these variables are categorical by wrapping the variables in C() in the formula string. For example, if we wanted to fit a model with X1 and X2 as categorical variables, we would write:

model = smf.logit('T ~ 1 + C(X1) + C(X2) + X3 + X4', data=data).fit()
# load the learning_mindset data again
learning_df = pd.read_csv("~/COMSC-341CD/data/learning_mindset.csv")
# TODO fit a propensity score model
formula = ''
#propensity_model = smf.logit(formula=formula, data=learning_df).fit()

### BEGIN SOLUTION
formula = 'intervention ~ 1 + success_expect + frst_in_family + C(school_urbanicity) + C(gender)'
propensity_model = smf.logit(formula=formula, data=learning_df).fit()
### END SOLUTION
Optimization terminated successfully.
         Current function value: 0.628017
         Iterations 5

We can then add the predicted propensity scores as a column to analyze along with the rest of the data. Run the cell below to generate predicted propensity scores and add them to the learning_df:

pred_propensity_scores = propensity_model.predict(learning_df)
learning_df['propensity_score'] = pred_propensity_scores

A quick check for whether we’ve fit the propensity score correctly is to verify that the average propensity score is equal to the probability of treatment in the dataset:

p_treat = learning_df['intervention'].mean()

assert np.isclose(learning_df['propensity_score'].mean(), p_treat, atol=0.01), "The average propensity score is not equal to the probability of treatment"

Propensity score give us a lot of flexibility in visualizing confounding issues. For example, we can generate the same barplot we made in part 1 with success_expect on the x-axis, but with the propensity_score on the y-axis for our learning_df dataset.

What do you observe about the relationship between success_expect and propensity_score?

Your response: pollev.com/tliu

# TODO a barplot of success_expect vs propensity_score

### BEGIN SOLUTION
sns.barplot(x='success_expect', y='propensity_score', data=learning_df)
# call out the precision in the confounding that we're able to see vs just plotting the raw intervention
### END SOLUTION
<Axes: xlabel='success_expect', ylabel='propensity_score'>
../_images/70e60a67eb43b0298e4f01e4e84eb8905e8a81cf8079141b9b9bef24bb27cec5.png

Part 3: Propensity scores for exploring covariate distribution#

In Project 1, we plotted the distribution of covariates of treated and control units to visually verify that the covariates were balanced between the two groups.

The propensity score is a convenient way check whether all the covariates we want to control for are balanced between the two groups.

Run the cell below to fit a propensity score model for our simulated data from Part 1. Then plot the distribution of propensity scores for the treated and control units by creating a sns.kdeplot with the x-axis as the propensity_score and the hue as the sim_treat variable – optionally setting cumulative=True to plot the cumulative distribution. The distributions should approximately match since we simulated a randomized experiment:

sim_propensity_model = smf.logit('sim_treat ~ 1 + sim_success_expect', data=sim_df).fit()
sim_propensity_scores = sim_propensity_model.predict(sim_df)
sim_df['propensity_score'] = sim_propensity_scores

# TODO kdeplot

### BEGIN SOLUTION
sns.kdeplot(x='propensity_score', data=sim_df, hue='sim_treat', cumulative=True)
### END SOLUTION
Optimization terminated successfully.
         Current function value: 0.693081
         Iterations 3
<Axes: xlabel='propensity_score', ylabel='Density'>
../_images/d96d1457b7ba08c46bd013b72032e379b40a088ae910eb65cfa20b44e69438d9.png

Now, create the same plot for our learning_df dataset, setting the hue to intervention and x to propensity_score.

Differences in the distributions suggest that the covariates are not balanced between the two groups – in other words, there is confounding among the covariates we’ve included in the propensity score model.

sns.kdeplot(x='propensity_score', data=learning_df, hue='intervention')
<Axes: xlabel='propensity_score', ylabel='Density'>
../_images/d2b976e07ff640c8d4449c0eaf354190ba663548aaba303e406ac2449ec2c6eb.png

This type of density plot of the propensity score also allows us to check positivity – if there are regions of the propensity score where there are only treated or only control units, then there is a positivity violation.

Does there appear to be any obvious regions where there are only treated or only control units? What about regions where there are few treated or control units?

Your response: pollev.com/tliu

Part 4: Propensity scores as a balancer#

Now that we have the propensity score distilling high-dimensional covariates into a single number, binning the data by the propensity score becomes a viable strategy for balancing the covariates.

Run the cell below to bin the data by the propensity score using the pd.cut function, producing a new column propensity_score_bin in the learning_df dataframe.

n_bins = 5
learning_df['propensity_score_bin'] = pd.cut(learning_df['propensity_score'], bins=n_bins)

# view the bins created by pd.cut
print(learning_df['propensity_score_bin'].unique())
[(0.287, 0.329], (0.245, 0.287], (0.329, 0.371], (0.371, 0.412], (0.203, 0.245]]
Categories (5, interval[float64, right]): [(0.203, 0.245] < (0.245, 0.287] < (0.287, 0.329] < (0.329, 0.371] < (0.371, 0.412]]

Then, perform a groupby with by='[propensity_score_bin', 'intervention'], and calculate the mean of the given columns on the grouped data.

Are there any of the columns that are significantly different between the \(T=1\) and \(T=0\) groups for each propensity score bin? If so, how do you interpret this?

Your response: pollev.com/tliu

# TODO group by propensity score bin and intervention, and calculate the mean of the columns on the grouped data
columns = ['success_expect', 'frst_in_family', 'school_urbanicity', 'gender', 'achievement_score']

### BEGIN SOLUTION
balance_df = learning_df.groupby(['propensity_score_bin', 'intervention'])[columns].mean()
### END SOLUTION

balance_df
/var/folders/92/j4ms2q355sv0n3q55bgv_hwm0000gn/T/ipykernel_6888/3562701198.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  balance_df = learning_df.groupby(['propensity_score_bin', 'intervention'])[columns].mean()
success_expect frst_in_family school_urbanicity gender achievement_score
propensity_score_bin intervention
(0.203, 0.245] 0 1.966667 0.988889 2.294444 1.766667 -0.855587
1 1.950000 0.983333 2.516667 1.816667 -0.671004
(0.245, 0.287] 0 4.013749 0.957837 2.274977 1.726856 -0.766884
1 3.984694 0.954082 2.283163 1.709184 -0.434164
(0.287, 0.329] 0 5.213038 0.876008 2.446909 1.530242 -0.258401
1 5.270698 0.883369 2.436285 1.521238 0.164735
(0.329, 0.371] 0 5.704668 0.278133 2.444226 1.445209 0.077631
1 5.706571 0.274527 2.454545 1.452745 0.489200
(0.371, 0.412] 0 6.533793 0.180690 2.706207 1.110345 0.722760
1 6.500000 0.148148 2.696759 1.097222 1.194055