Activity 3 Solution: Randomized Experiment Balance in Gerber et al. 2008#

2025-02-11


# imports we'll need
import numpy as np
import pandas as pd

One way we can check whether exchangeability is plausible is to check whether the distribution of data characteristics are the same between the treatment and control groups. First, let’s load in the data at ~/COMSC-341CD/data/gerber2008_activity3.csv and take a look at the first few rows via DataFrame.head().

vote_df = pd.read_csv("../COMSC-341CD/data/gerber2008_activity3.csv")
vote_df.head()
sex yob treatment voted hh_size g2000 g2002 g2004 p2000 p2002 p2004
0 male 1941 Civic Duty No 2 yes yes yes no yes No
1 female 1947 Civic Duty No 2 yes yes yes no yes No
2 male 1951 Hawthorne Yes 3 yes yes yes no yes No
3 female 1950 Hawthorne Yes 3 yes yes yes no yes No
4 female 1982 Hawthorne Yes 3 yes yes yes no yes No
vote_df['treatment'].unique()
array(['Civic Duty', 'Hawthorne', 'Control', 'Self', 'Neighbors'],
      dtype=object)

The columns we have are:

  • sex: the respondent’s sex

  • yob: the respondent’s year of birth

  • treatment: the treatment the respondent received

  • voted: our outcome: whether the respondent voted in the 2006 primary election

  • hh_size: the size of the respondent’s household

  • g2000, g2002, g2004: whether the respondent voted in the 2000, 2002, and 2004 general election

  • p2000, p2002, p2004: whether the respondent voted in the 2000, 2002, and 2004 primary election

1. Data cleaning#

The full study has 4 treatment arms and a control group, but today we’re only interested in the Civic Duty and Neighbors arms. These correspond to the \(T=0\) and \(T=1\) interventions we discussed earlier:

\[\begin{split} T = \begin{cases} 1 & \text{mail stating that their household turnout would be publicized to their neighbors} \\ 0 & \text{mail reminding them that voting is a civic duty} \end{cases} \end{split}\]

Let’s clean the data to only include these two arms by filtering the treatment column.

Note

You may find the pandas.Series.isin() method useful here in combination with the logical indexing with .locwe saw on worksheet 2.

You may encounter a data issue with the treatment column – try to identify what it is and we’ll discuss how to fix it together.

# TODO your code here selecting the 'Civic Duty' and 'Neighbors' rows
# selecting rows with 'Civic Duty' or 'Neighbors', and selecting all the columns with :
vote_df = vote_df.loc[(vote_df['treatment'] == 'Civic Duty') | (vote_df['treatment'] == 'Neighbors'), :]
vote_df['treatment'].unique()
array(['Civic Duty', 'Neighbors'], dtype=object)

Let’s generate an additional column that corresponds to the respondent’s age in 2006.

# TODO generate the 'age' column
vote_df['age'] = 2006 - vote_df['yob'] 

2. Balance check#

Let’s now check the balance of the various characteristics between the \(T=1\) ('Neighbors') and \(T=0\) ('Civic Duty') groups.

For the continuous variables of age and hh_size, let’s compute the mean for each group:

Tip

You may find the pandas.DataFrame.groupby() method from worksheet 2 useful here, or you can compute the means by slicing the dataframe using .loc.

# TODO your code for continuous variables
vote_df.groupby(by='treatment')[['age', 'hh_size']].mean()
age hh_size
treatment
Civic Duty 49.659035 2.189126
Neighbors 49.852936 2.187770

For the each of the categorical variables of ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004'], let’s compute the proportion of respondents who have the characteristic for each treatment group:

Tip

You may want to use the value_counts() method with normalize=True along with .groupby() to compute the proportions for each individual characteristic.

# TODO your code for categorical variable comparison
categorical_vars = ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004']

for col in categorical_vars:
    display(vote_df.groupby(by='treatment')[col].value_counts(normalize=True))
treatment   sex   
Civic Duty  female    0.500183
            male      0.499817
Neighbors   female    0.500065
            male      0.499935
Name: proportion, dtype: float64
treatment   g2000
Civic Duty  yes      0.841724
            no       0.158276
Neighbors   yes      0.841653
            no       0.158347
Name: proportion, dtype: float64
treatment   g2002
Civic Duty  yes      0.81111
            no       0.18889
Neighbors   yes      0.81134
            no       0.18866
Name: proportion, dtype: float64
treatment   g2004
Civic Duty  yes      1.0
Neighbors   yes      1.0
Name: proportion, dtype: float64
treatment   p2000
Civic Duty  no       0.746428
            yes      0.253572
Neighbors   no       0.748802
            yes      0.251198
Name: proportion, dtype: float64
treatment   p2002
Civic Duty  no       0.611152
            yes      0.388848
Neighbors   no       0.613413
            yes      0.386587
Name: proportion, dtype: float64
treatment   p2004
Civic Duty  No       0.600555
            Yes      0.399445
Neighbors   No       0.593335
            Yes      0.406665
Name: proportion, dtype: float64

Are there any differences between the \(T=1\) and \(T=0\) groups in terms of these characteristics, or do they seem comparable?

Your response: All characteristics seem comparable with no significant differences between the two groups.

3. Causal effect estimation#

Finally, let’s estimate the ATE for the voted outcome. From what we looked at in class, because this is a randomized experiment, we can estimate the ATE by computing the difference in means between the \(T=1\) and \(T=0\) groups for the voted outcome. For this particular dataset, that means:

\[\begin{split} \begin{align*} \text{ATE} = E[Y(1) - Y(0)] \xrightarrow[]{\text{Identification}} \; &E[Y | T=1] - E[Y | T=0]\\ E[Y | T=1] - E[Y | T=0] \; \xrightarrow[]{\text{Estimation}} \; &\hat{E}[Y | T=1] - \hat{E}[Y | T=0]\\ =&\hat{P}(\text{voted='Yes'} \mid \text{treatment='Neighbors'}) - \hat{P}(\text{voted='Yes'} \mid \text{treatment='Civic Duty'}) \end{align*} \end{split}\]

Tip

Try computing the two proportions separately and then subtracting them.

We can compute the proportion of respondents who voted ‘Yes’ in each treatment group by using the .value_counts() method with normalize=True and indexing on the 'Yes' category.

# TODO your code for estimating the ATE
display(vote_df.groupby(by='treatment')['voted'].value_counts(normalize=True))
print(f"ATE solution from class: {.377948 - .314538}")
print("----------------")

# Can also be extracted via indexing on the 'Yes' category
EY_neighbors = vote_df.loc[vote_df['treatment'] == 'Neighbors']['voted'].value_counts(normalize=True)['Yes'] 
EY_civic = vote_df.loc[vote_df['treatment'] == 'Civic Duty']['voted'].value_counts(normalize=True)['Yes']
ATE = EY_neighbors - EY_civic
print(f"Another way to compute ATE via .loc indexing: {ATE}")
treatment   voted
Civic Duty  No       0.685462
            Yes      0.314538
Neighbors   No       0.622052
            Yes      0.377948
Name: proportion, dtype: float64
ATE solution from class: 0.06341000000000002
----------------
Another way to compute ATE via .loc indexing: 0.06341056883566026

Write down an interpretation of this estimate as a causal quantity.

Your response: Applying social pressure through the mail increases voter turnout on average by 6.3% over mail that reminds them that voting is a civic duty.

References#