Activity 10: Propensity Scores for Overlap and Balancing Live#

2025-10-08


import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

from ipywidgets import interact_manual
from ipywidgets import widgets


rng = np.random.default_rng(42)

Part 1: Distribution of covariates under true randomization#

For the growth mindset intervention, we saw that the success_expect covariate was associated with the intervention. What should the distribution of success_expect look like under true randomization (coin flips)? And does it matter if the coin flips are “biased” (not 50% heads, 50% tails)?

First, let’s simulate a distribution of success_expect under true randomization. Begin by generating some data that represents success_expect, which takes on integer values between 1 and 7. Use the rng.choice function to generate 10,000 samples from a uniform distribution over the integers 1 through 7.

help(rng.choice)
Help on built-in function choice:

choice(...) method of numpy.random._generator.Generator instance
    choice(a, size=None, replace=True, p=None, axis=0, shuffle=True)
    
    Generates a random sample from a given array
    
    Parameters
    ----------
    a : {array_like, int}
        If an ndarray, a random sample is generated from its elements.
        If an int, the random sample is generated from np.arange(a).
    size : {int, tuple[int]}, optional
        Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        ``m * n * k`` samples are drawn from the 1-d `a`. If `a` has more
        than one dimension, the `size` shape will be inserted into the
        `axis` dimension, so the output ``ndim`` will be ``a.ndim - 1 +
        len(size)``. Default is None, in which case a single value is
        returned.
    replace : bool, optional
        Whether the sample is with or without replacement. Default is True,
        meaning that a value of ``a`` can be selected multiple times.
    p : 1-D array_like, optional
        The probabilities associated with each entry in a.
        If not given, the sample assumes a uniform distribution over all
        entries in ``a``.
    axis : int, optional
        The axis along which the selection is performed. The default, 0,
        selects by row.
    shuffle : bool, optional
        Whether the sample is shuffled when sampling without replacement.
        Default is True, False provides a speedup.
    
    Returns
    -------
    samples : single item or ndarray
        The generated random samples
    
    Raises
    ------
    ValueError
        If a is an int and less than zero, if p is not 1-dimensional, if
        a is array-like with a size 0, if p is not a vector of
        probabilities, if a and p have different lengths, or if
        replace=False and the sample size is greater than the population
        size.
    
    See Also
    --------
    integers, shuffle, permutation
    
    Notes
    -----
    Setting user-specified probabilities through ``p`` uses a more general but less
    efficient sampler than the default. The general sampler produces a different sample
    than the optimized sampler even if each element of ``p`` is 1 / len(a).
    
    Examples
    --------
    Generate a uniform random sample from np.arange(5) of size 3:
    
    >>> rng = np.random.default_rng()
    >>> rng.choice(5, 3)
    array([0, 3, 4]) # random
    >>> #This is equivalent to rng.integers(0,5,3)
    
    Generate a non-uniform random sample from np.arange(5) of size 3:
    
    >>> rng.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
    array([3, 3, 0]) # random
    
    Generate a uniform random sample from np.arange(5) of size 3 without
    replacement:
    
    >>> rng.choice(5, 3, replace=False)
    array([3,1,0]) # random
    >>> #This is equivalent to rng.permutation(np.arange(5))[:3]
    
    Generate a uniform random sample from a 2-D array along the first
    axis (the default), without replacement:
    
    >>> rng.choice([[0, 1, 2], [3, 4, 5], [6, 7, 8]], 2, replace=False)
    array([[3, 4, 5], # random
           [0, 1, 2]])
    
    Generate a non-uniform random sample from np.arange(5) of size
    3 without replacement:
    
    >>> rng.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
    array([2, 3, 0]) # random
    
    Any of the above can be repeated with an arbitrary array-like
    instead of just integers. For instance:
    
    >>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
    >>> rng.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
    array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
          dtype='<U11')
n_samples = 10000
sim_success_expect = rng.choice(a=np.arange(1, 8), size=n_samples)

Next, generate random treatment assignments for each sample using the rng.choice function. For now, assume the probability of treatment is 0.5, that is:

\[ P(T=1) = 0.5 \]
p_treat = 0.5
# p needs to be a list of probabilities for each outcome: [0, 1]
p = [1-p_treat, p_treat]
sim_treat = rng.choice(a=[0, 1], size=n_samples, p=p)

Let’s now plot the distribution of treatment for each sim_success_expect level. We’ll do this by creating a dataframe with the two columns sim_success_expect and sim_treat, and then using sns.barplot to plot the distribution of sim_treat for each sim_success_expect level.

Call sns.barplot with the x and y parameters set to sim_success_expect and sim_treat, respectively.

sim_df = pd.DataFrame({
    'sim_success_expect': sim_success_expect, 
    'sim_treat': sim_treat
})

sim_df.head()
# TODO sns.barplot
sns.barplot(x='sim_success_expect', y='sim_treat', data=sim_df)
# TODO horizontal line corresponding to p_treat
plt.axhline(y=p_treat, label="true probability of treatment", color="red")
<matplotlib.lines.Line2D at 0x34af88990>
../_images/f60621099bf8491bd2ef7a552964c3472efe9633a86ccde7ae6a900602a42d92.png

The black vertical lines on each bar represnt a bootstrapped confidence interval for the proportion of treatment in each sim_success_expect level. On your plot above, add a horizontal line using plt.axhline at y=p_treat to represent the true probability of treatment. The observed probability of treatment shouldn’t deviate significantly from the true probability of treatment for any of the sim_success_expect values.

Finally, let’s explore different different probabilities of treatment. Copy your code from the above three cells that created your barplot into the plot_treatment_distribution function below, and add an interact_manual decorator that lets you change p_treat.

Does the probability of treatment change the distribution of treated units among the sim_success_expect values?

# TODO an interact_manual
@interact_manual(p_treat=np.arange(0, 1, 0.1))
def plot_treatment_distribution(p_treat):
    n_samples = 10000
    sim_success_expect = rng.choice(a=np.arange(1, 8), size=n_samples)
    p = [1-p_treat, p_treat]
    sim_treat = rng.choice(a=[0, 1], size=n_samples, p=p)

    sim_df = pd.DataFrame({
    'sim_success_expect': sim_success_expect, 
    'sim_treat': sim_treat
    })
    
    sim_df.head()
    # TODO sns.barplot
    sns.barplot(x='sim_success_expect', y='sim_treat', data=sim_df)
    # TODO horizontal line corresponding to p_treat
    plt.axhline(y=p_treat, label="true probability of treatment", color="red")

Part 2: Fitting a Propensity Score#

Let’s now estimate a propensity score for the learning mindsets data. As a reminder from Tuesday’s lecture, the columns we’ll look at are:

  • intervention: whether the student received the intervention (1) or not (0)

  • success_expect: student prior mindset about their ability to succeed in school (higher values indicate a stronger belief in their ability to succeed)

  • frst_in_family: whether the student would be the first in their family to attend college (1) or not (0)

  • gender: student’s self-reported gender

  • school_urbanicity: categorical variable corresponding to the urbanicity of the school the student attends, e.g. urban, suburban, rural

  • achievement_score: the student’s future grade achievement, standardized such that 0 is the mean and it has a standard deviation of 1

The way we’ll do this is by fitting a logistic regression model, which allows us estimate the following:

\[ P(T=1 \mid X) = E[T \mid X] = e(X) \]

Like with linear regression, this can extend to multiple covariates, such as:

\[ P(T=1 \mid X_1, X_2, X_3, X_4) = E[T \mid X_1, X_2, X_3, X_4] \]

In statmodels, we just need to specify the formula string 'T ~1 + X1 + X2 + X3 + X4' and pass the dataframe to smf.logit instead of smf.ols to fit a logistic regression model:

model = smf.logit('T ~ 1 + X1 + X2 + X3 + X4', data=data).fit()

Fit a propensity score model predicting intervention and our four covariates from Tuesday’s lecture (success_expect, gender, frst_in_family, school_urbanicity).

One additional consideration is that gender and school_urbanicity are categorical variables. We can indicate to smf.logit that these variables are categorical by wrapping the variables in C() in the formula string. For example, if we wanted to fit a model with X1 and X2 as categorical variables, we would write:

model = smf.logit('T ~ 1 + C(X1) + C(X2) + X3 + X4', data=data).fit()
# load the learning_mindset data again
learning_df = pd.read_csv("~/COMSC-341CD/data/learning_mindset.csv")
# TODO fit a propensity score model
formula = 'intervention ~ 1 + success_expect + frst_in_family + C(gender) + C(school_urbanicity)'
propensity_model = smf.logit(formula=formula, data=learning_df).fit()
#propensity_model = smf.logit(formula=formula, data=learning_df).fit()
Optimization terminated successfully.
         Current function value: 0.628017
         Iterations 5

We can then add the predicted propensity scores as a column to analyze along with the rest of the data. Run the cell below to generate predicted propensity scores and add them to the learning_df:

pred_propensity_scores = propensity_model.predict(learning_df)
learning_df['propensity_score'] = pred_propensity_scores

A quick check for whether we’ve fit the propensity score correctly is to verify that the average propensity score is equal to the probability of treatment in the dataset:

p_treat = learning_df['intervention'].mean()

assert np.isclose(learning_df['propensity_score'].mean(), p_treat, atol=0.01), "The average propensity score is not equal to the probability of treatment"

Propensity score give us a lot of flexibility in visualizing confounding issues. For example, we can generate the same barplot we made in part 1 with success_expect on the x-axis, but with the propensity_score on the y-axis for our learning_df dataset.

What do you observe about the relationship between success_expect and propensity_score?

Your response: pollev.com/tliu

# TODO a barplot of x=success_expect vs y=propensity_score
sns.barplot(x='success_expect', y='propensity_score', data=learning_df)
<Axes: xlabel='success_expect', ylabel='propensity_score'>
../_images/aaac8005b60096d0627b37da708f79b6262378d64defc033582c8c021a85d1d0.png

Part 3: Propensity scores for exploring covariate distribution#

In Project 1, we plotted the distribution of covariates of treated and control units to visually verify that the covariates were balanced between the two groups.

The propensity score is a convenient way check whether all the covariates we want to control for are balanced between the two groups.

Run the cell below to fit a propensity score model for our simulated data from Part 1. Then plot the distribution of propensity scores for the treated and control units by creating a sns.kdeplot with the x-axis as the propensity_score and the hue as the sim_treat variable – optionally setting cumulative=True to plot the cumulative distribution. The distributions should approximately match since we simulated a randomized experiment:

sim_propensity_model = smf.logit('sim_treat ~ 1 + sim_success_expect', data=sim_df).fit()
sim_propensity_scores = sim_propensity_model.predict(sim_df)
sim_df['propensity_score'] = sim_propensity_scores

# TODO kdeplot
sns.kdeplot(x='propensity_score', hue='sim_treat', data=sim_df)
Optimization terminated successfully.
         Current function value: 0.693081
         Iterations 3
<Axes: xlabel='propensity_score', ylabel='Density'>
../_images/4d4886656449cdbfc9312970068bd1278e72d4d9c4415a6b7ee8c79444d4cd60.png

Now, create the same plot for our learning_df dataset, setting the hue to intervention and x to propensity_score.

Differences in the distributions suggest that the covariates are not balanced between the two groups – in other words, there is confounding among the covariates we’ve included in the propensity score model.

sns.kdeplot(x='propensity_score', data=learning_df, hue='intervention')
<Axes: xlabel='propensity_score', ylabel='Density'>
../_images/d2b976e07ff640c8d4449c0eaf354190ba663548aaba303e406ac2449ec2c6eb.png

This type of density plot of the propensity score also allows us to check positivity – if there are regions of the propensity score where there are only treated or only control units, then there is a positivity violation.

Does there appear to be any obvious regions where there are only treated or only control units? What about regions where there are few treated or control units?

Your response: pollev.com/tliu

Part 4: Propensity scores as a balancer#

Now that we have the propensity score distilling high-dimensional covariates into a single number, binning the data by the propensity score becomes a viable strategy for balancing the covariates.

Run the cell below to bin the data by the propensity score using the pd.cut function, producing a new column propensity_score_bin in the learning_df dataframe.

n_bins = 5
learning_df['propensity_score_bin'] = pd.cut(learning_df['propensity_score'], bins=n_bins)

# view the bins created by pd.cut
print(learning_df['propensity_score_bin'].unique())
[(0.287, 0.329], (0.245, 0.287], (0.329, 0.371], (0.371, 0.412], (0.203, 0.245]]
Categories (5, interval[float64, right]): [(0.203, 0.245] < (0.245, 0.287] < (0.287, 0.329] < (0.329, 0.371] < (0.371, 0.412]]

Then, perform a groupby with by='[propensity_score_bin', 'intervention'], and calculate the mean of the given columns on the grouped data.

Are there any of the columns that are significantly different between the \(T=1\) and \(T=0\) groups for each propensity score bin? If so, how do you interpret this?

Your response: pollev.com/tliu

# TODO group by propensity score bin and intervention, and calculate the mean of the columns on the grouped data
columns = ['success_expect', 'frst_in_family', 'school_urbanicity', 'gender', 'achievement_score']