Activity 3: Randomized Experiment Balance in Gerber et al. 2008#

2025-02-11


# imports we'll need
import numpy as np
import pandas as pd

One way we can check whether exchangeability is plausible is to check whether the distribution of data characteristics are the same between the treatment and control groups. First, let’s load in the data at ~/COMSC-341CD/data/gerber2008_activity3.csv and take a look at the first few rows via DataFrame.head().

# TODO uncomment to load data
#vote_df = pd.read_csv("~/COMSC-341CD/data/gerber2008_activity3.csv")
#vote_df.head()

The columns we have are:

  • sex: the respondent’s sex

  • yob: the respondent’s year of birth

  • treatment: the treatment the respondent received

  • voted: our outcome: whether the respondent voted in the 2006 primary election

  • hh_size: the size of the respondent’s household

  • g2000, g2002, g2004: whether the respondent voted in the 2000, 2002, and 2004 general election

  • p2000, p2002, p2004: whether the respondent voted in the 2000, 2002, and 2004 primary election

1. Data cleaning#

The full study has 4 treatment arms and a control group, but today we’re only interested in the Civic Duty and Neighbors arms. These correspond to the \(T=0\) and \(T=1\) interventions we discussed earlier:

\[\begin{split} T = \begin{cases} 1 & \text{mail stating that their household turnout would be publicized to their neighbors} \\ 0 & \text{mail reminding them that voting is a civic duty} \end{cases} \end{split}\]

Let’s clean the data to only include these two arms by filtering the treatment column.

Note

You may find the pandas.Series.isin() method useful here in combination with the logical indexing with .locwe saw on worksheet 2.

You may encounter a data issue with the treatment column – try to identify what it is and we’ll discuss how to fix it together.

# TODO your code here selecting the 'Civic Duty' and 'Neighbors' rows

Let’s generate an additional column that corresponds to the respondent’s age in 2006.

# TODO generate the 'age' column

2. Balance check#

Let’s now check the balance of the various characteristics between the \(T=1\) ('Neighbors') and \(T=0\) ('Civic Duty') groups.

For the continuous variables of age and hh_size, let’s compute the mean for each group:

Tip

You may find the pandas.DataFrame.groupby() method from worksheet 2 useful here, or you can compute the means by slicing the dataframe using .loc.

# TODO your code for continuous variables

For the each of the categorical variables of ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004'], let’s compute the proportion of respondents who have the characteristic for each treatment group:

Tip

You may want to use the value_counts() method with normalize=True along with .groupby() to compute the proportions for each individual characteristic.

# TODO your code for categorical variable comparison
categorical_vars = ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004']

Are there any differences between the \(T=1\) and \(T=0\) groups in terms of these characteristics, or do they seem comparable?

Your response: TODO

3. Causal effect estimation#

Finally, let’s estimate the ATE for the voted outcome. From what we looked at in class, because this is a randomized experiment, we can estimate the ATE by computing the difference in means between the \(T=1\) and \(T=0\) groups for the voted outcome. For this particular dataset, that means:

\[\begin{split} \begin{align*} \text{ATE} = E[Y(1) - Y(0)] \xrightarrow[]{\text{Identification}} \; &E[Y | T=1] - E[Y | T=0]\\ E[Y | T=1] - E[Y | T=0] \; \xrightarrow[]{\text{Estimation}} \; &\hat{E}[Y | T=1] - \hat{E}[Y | T=0]\\ =&\hat{P}(\text{voted='Yes'} \mid \text{treatment='Neighbors'}) - \hat{P}(\text{voted='Yes'} \mid \text{treatment='Civic Duty'}) \end{align*} \end{split}\]

Tip

Try computing the two proportions separately and then subtracting them.

We can compute the proportion of respondents who voted ‘Yes’ in each treatment group by using the .value_counts() method with normalize=True and indexing on the 'Yes' category.

# TODO your code for estimating the ATE
estimated_ATE = 0

Write down an interpretation of this estimate as a causal quantity.

Your response: TODO

References#