Activity 3 Solution: Randomized Experiment Balance in Gerber et al. 2008#
2025-02-11
# imports we'll need
import numpy as np
import pandas as pd
One way we can check whether exchangeability is plausible is to check whether the distribution of data characteristics are the same between the treatment and control groups. First, let’s load in the data at ~/COMSC-341CD/data/gerber2008_activity3.csv
and take a look at the first few rows via DataFrame.head().
vote_df = pd.read_csv("../COMSC-341CD/data/gerber2008_activity3.csv")
vote_df.head()
sex | yob | treatment | voted | hh_size | g2000 | g2002 | g2004 | p2000 | p2002 | p2004 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 1941 | Civic Duty | No | 2 | yes | yes | yes | no | yes | No |
1 | female | 1947 | Civic Duty | No | 2 | yes | yes | yes | no | yes | No |
2 | male | 1951 | Hawthorne | Yes | 3 | yes | yes | yes | no | yes | No |
3 | female | 1950 | Hawthorne | Yes | 3 | yes | yes | yes | no | yes | No |
4 | female | 1982 | Hawthorne | Yes | 3 | yes | yes | yes | no | yes | No |
vote_df['treatment'].unique()
array(['Civic Duty', 'Hawthorne', 'Control', 'Self', 'Neighbors'],
dtype=object)
The columns we have are:
sex
: the respondent’s sexyob
: the respondent’s year of birthtreatment
: the treatment the respondent receivedvoted
: our outcome: whether the respondent voted in the 2006 primary electionhh_size
: the size of the respondent’s householdg2000
,g2002
,g2004
: whether the respondent voted in the 2000, 2002, and 2004 general electionp2000
,p2002
,p2004
: whether the respondent voted in the 2000, 2002, and 2004 primary election
1. Data cleaning#
The full study has 4 treatment arms and a control group, but today we’re only interested in the Civic Duty
and Neighbors
arms. These correspond to the \(T=0\) and \(T=1\) interventions we discussed earlier:
Let’s clean the data to only include these two arms by filtering the treatment
column.
Note
You may find the pandas.Series.isin() method useful here in combination with the logical indexing with .loc
we saw on worksheet 2.
You may encounter a data issue with the treatment
column – try to identify what it is and we’ll discuss how to fix it together.
# TODO your code here selecting the 'Civic Duty' and 'Neighbors' rows
# selecting rows with 'Civic Duty' or 'Neighbors', and selecting all the columns with :
vote_df = vote_df.loc[(vote_df['treatment'] == 'Civic Duty') | (vote_df['treatment'] == 'Neighbors'), :]
vote_df['treatment'].unique()
array(['Civic Duty', 'Neighbors'], dtype=object)
Let’s generate an additional column that corresponds to the respondent’s age in 2006.
# TODO generate the 'age' column
vote_df['age'] = 2006 - vote_df['yob']
2. Balance check#
Let’s now check the balance of the various characteristics between the \(T=1\) ('Neighbors'
) and \(T=0\) ('Civic Duty'
) groups.
For the continuous variables of age
and hh_size
, let’s compute the mean for each group:
Tip
You may find the pandas.DataFrame.groupby() method from worksheet 2 useful here, or you can compute the means by slicing the dataframe using .loc
.
# TODO your code for continuous variables
vote_df.groupby(by='treatment')[['age', 'hh_size']].mean()
age | hh_size | |
---|---|---|
treatment | ||
Civic Duty | 49.659035 | 2.189126 |
Neighbors | 49.852936 | 2.187770 |
For the each of the categorical variables of ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004']
, let’s compute the proportion of respondents who have the characteristic for each treatment
group:
Tip
You may want to use the value_counts()
method with normalize=True
along with .groupby()
to compute the proportions for each individual characteristic.
# TODO your code for categorical variable comparison
categorical_vars = ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004']
for col in categorical_vars:
display(vote_df.groupby(by='treatment')[col].value_counts(normalize=True))
treatment sex
Civic Duty female 0.500183
male 0.499817
Neighbors female 0.500065
male 0.499935
Name: proportion, dtype: float64
treatment g2000
Civic Duty yes 0.841724
no 0.158276
Neighbors yes 0.841653
no 0.158347
Name: proportion, dtype: float64
treatment g2002
Civic Duty yes 0.81111
no 0.18889
Neighbors yes 0.81134
no 0.18866
Name: proportion, dtype: float64
treatment g2004
Civic Duty yes 1.0
Neighbors yes 1.0
Name: proportion, dtype: float64
treatment p2000
Civic Duty no 0.746428
yes 0.253572
Neighbors no 0.748802
yes 0.251198
Name: proportion, dtype: float64
treatment p2002
Civic Duty no 0.611152
yes 0.388848
Neighbors no 0.613413
yes 0.386587
Name: proportion, dtype: float64
treatment p2004
Civic Duty No 0.600555
Yes 0.399445
Neighbors No 0.593335
Yes 0.406665
Name: proportion, dtype: float64
Are there any differences between the \(T=1\) and \(T=0\) groups in terms of these characteristics, or do they seem comparable?
Your response: All characteristics seem comparable with no significant differences between the two groups.
3. Causal effect estimation#
Finally, let’s estimate the ATE for the voted
outcome. From what we looked at in class, because this is a randomized experiment, we can estimate the ATE by computing the difference in means between the \(T=1\) and \(T=0\) groups for the voted
outcome. For this particular dataset, that means:
Tip
Try computing the two proportions separately and then subtracting them.
We can compute the proportion of respondents who voted ‘Yes’ in each treatment group by using the .value_counts()
method with normalize=True
and indexing on the 'Yes'
category.
# TODO your code for estimating the ATE
display(vote_df.groupby(by='treatment')['voted'].value_counts(normalize=True))
print(f"ATE solution from class: {.377948 - .314538}")
print("----------------")
# Can also be extracted via indexing on the 'Yes' category
EY_neighbors = vote_df.loc[vote_df['treatment'] == 'Neighbors']['voted'].value_counts(normalize=True)['Yes']
EY_civic = vote_df.loc[vote_df['treatment'] == 'Civic Duty']['voted'].value_counts(normalize=True)['Yes']
ATE = EY_neighbors - EY_civic
print(f"Another way to compute ATE via .loc indexing: {ATE}")
treatment voted
Civic Duty No 0.685462
Yes 0.314538
Neighbors No 0.622052
Yes 0.377948
Name: proportion, dtype: float64
ATE solution from class: 0.06341000000000002
----------------
Another way to compute ATE via .loc indexing: 0.06341056883566026
Write down an interpretation of this estimate as a causal quantity.
Your response: Applying social pressure through the mail increases voter turnout on average by 6.3% over mail that reminds them that voting is a civic duty.
References#
Gerber, Alan S., Donald P. Green, and Christopher W. Larimer, 2008, Replication Materials for “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” http://hdl.handle.net/10079/c7507a0d-097a-4689-873a-7424564dfc82. ISPS Data Archive.
Full article: https://isps.yale.edu/sites/default/files/publication/2012/12/ISPS08-001.pdf