Activity 4: Randomized Experiment Balance and Estimation in Gerber et al. 2008#
2025-09-15
# imports we'll need
import numpy as np
import pandas as pd
First, let’s load in the data at ~/COMSC-341CD/data/gerber2008.csv
and take a look at the first few rows via DataFrame.head().
# TODO uncomment to load data
vote_df = pd.read_csv("~/COMSC-341CD/data/gerber2008.csv")
# TODO examining the top and bottom of the dataframe
vote_df.head()
sex | yob | treatment | voted | hh_size | g2000 | g2002 | g2004 | p2000 | p2002 | p2004 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 1941 | Civic Duty | No | 2 | yes | yes | yes | no | yes | No |
1 | female | 1947 | Civic Duty | No | 2 | yes | yes | yes | no | yes | No |
2 | male | 1951 | Hawthorne | Yes | 3 | yes | yes | yes | no | yes | No |
3 | female | 1950 | Hawthorne | Yes | 3 | yes | yes | yes | no | yes | No |
4 | female | 1982 | Hawthorne | Yes | 3 | yes | yes | yes | no | yes | No |
# get all of the unique values for column treatment
vote_df['treatment'].unique()
array(['Civic Duty', 'Hawthorne', 'Control', 'Self', 'Neighbors'],
dtype=object)
# TODO giving a summary of the dataframe
vote_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344084 entries, 0 to 344083
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 344084 non-null object
1 yob 344084 non-null int64
2 treatment 344084 non-null object
3 voted 344084 non-null object
4 hh_size 344084 non-null int64
5 g2000 344084 non-null object
6 g2002 344084 non-null object
7 g2004 344084 non-null object
8 p2000 344084 non-null object
9 p2002 344084 non-null object
10 p2004 344084 non-null object
dtypes: int64(2), object(9)
memory usage: 28.9+ MB
# TODO get the column names
vote_df.columns
Index(['sex', 'yob', 'treatment', 'voted', 'hh_size', 'g2000', 'g2002',
'g2004', 'p2000', 'p2002', 'p2004'],
dtype='object')
# TODO get the number of rows and columns
vote_df.shape
(344084, 11)
# TODO return the data types of each column
vote_df.dtypes
sex object
yob int64
treatment object
voted object
hh_size int64
g2000 object
g2002 object
g2004 object
p2000 object
p2002 object
p2004 object
dtype: object
The columns we have are:
sex
: the respondent’s sexyob
: the respondent’s year of birthtreatment
: the treatment the respondent receivedvoted
: our outcome: whether the respondent voted in the 2006 primary electionhh_size
: the size of the respondent’s householdg2000
,g2002
,g2004
: whether the respondent voted in the 2000, 2002, and 2004 general electionp2000
,p2002
,p2004
: whether the respondent voted in the 2000, 2002, and 2004 primary election
1. Data cleaning#
The full study has 4 treatment arms and a control group, but today we’re only interested in the Civic Duty
and Neighbors
arms. These correspond to the \(T=0\) and \(T=1\) interventions we discussed earlier:
Let’s clean the data to only include these two arms by filtering the treatment
column.
Note
You may find the pandas.Series.isin() method useful here in combination with the logical indexing with .loc
we saw on worksheet 2.
You may encounter a data issue with the treatment
column – try to identify what it is and we’ll discuss how to fix it together.
vote_df.loc[vote_df['yob'] < 1950]
sex | yob | treatment | voted | hh_size | g2000 | g2002 | g2004 | p2000 | p2002 | p2004 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 1941 | Civic Duty | No | 2 | yes | yes | yes | no | yes | No |
1 | female | 1947 | Civic Duty | No | 2 | yes | yes | yes | no | yes | No |
10 | male | 1941 | Control | Yes | 1 | yes | yes | yes | no | no | Yes |
11 | male | 1945 | Hawthorne | No | 2 | yes | yes | yes | no | no | No |
12 | female | 1949 | Hawthorne | Yes | 2 | yes | yes | yes | no | yes | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
344077 | male | 1942 | Neighbors | No | 2 | yes | yes | yes | no | yes | No |
344078 | female | 1944 | Control | Yes | 2 | yes | yes | yes | no | no | No |
344079 | male | 1943 | Control | Yes | 2 | yes | yes | yes | no | yes | Yes |
344082 | male | 1937 | Control | Yes | 2 | yes | yes | yes | yes | yes | Yes |
344083 | female | 1949 | Control | Yes | 2 | yes | yes | yes | no | no | Yes |
106259 rows × 11 columns
# TODO your code here selecting the 'Civic Duty' and 'Neighbors' rows
vote_df = vote_df.loc[vote_df['treatment'].isin(['Civic Duty', 'Neighbors'])]
Let’s generate an additional column that corresponds to the respondent’s age in 2006.
# TODO generate the 'age' column
vote_df['age'] = 2006 - vote_df['yob']
vote_df.head()
sex | yob | treatment | voted | hh_size | g2000 | g2002 | g2004 | p2000 | p2002 | p2004 | age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 1941 | Civic Duty | No | 2 | yes | yes | yes | no | yes | No | 65 |
1 | female | 1947 | Civic Duty | No | 2 | yes | yes | yes | no | yes | No | 59 |
19 | male | 1939 | Neighbors | Yes | 1 | yes | yes | yes | no | yes | No | 67 |
27 | male | 1965 | Civic Duty | No | 2 | yes | yes | yes | yes | yes | No | 41 |
28 | female | 1965 | Civic Duty | No | 2 | yes | yes | yes | no | yes | No | 41 |
2. Balance check#
One way we can check whether exchangeability is plausible is to check whether the distribution of data characteristics are the same between the treatment and control groups.
Let’s now check the balance of the various characteristics between the \(T=1\) ('Neighbors'
) and \(T=0\) ('Civic Duty'
) groups.
For the continuous variables of age
and hh_size
, let’s compute the mean for each group:
Tip
You may find the pandas.DataFrame.groupby() method from worksheet 2 useful here, or you can compute the means by slicing the dataframe using .loc
.
vote_df['age'].mean()
# approach 1
vote_df.loc[T=1]['age'].mean()
vote_df.loc[T=0]['age'].mean()
# approach 2
vote_df.groupby('treatment')['age']
Cell In[13], line 4
vote_df.loc[T=1]['age'].mean()
^
SyntaxError: invalid syntax. Maybe you meant '==' or ':=' instead of '='?
# TODO your code for continuous variables
print(vote_df.groupby('treatment')['age'].mean())
print(vote_df.groupby('treatment')['hh_size'].mean())
treatment
Civic Duty 49.659035
Neighbors 49.852936
Name: age, dtype: float64
treatment
Civic Duty 2.189126
Neighbors 2.187770
Name: hh_size, dtype: float64
For the each of the categorical variables of ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004']
, let’s compute the proportion of respondents who have the characteristic for each treatment
group:
Tip
You may want to use the value_counts()
method with normalize=True
along with .groupby()
to compute the proportions for each individual characteristic.
# TODO, group or select by T=1 and T=0
vote_df['g2000'].value_counts(normalize=True)
g2000
yes 0.841689
no 0.158311
Name: proportion, dtype: float64
# TODO your code for categorical variable comparison
categorical_vars = ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004']
# we might need a for loop here
for category in categorical_vars:
print(vote_df.groupby('treatment')[category].value_counts(normalize=True))
treatment sex
Civic Duty female 0.500183
male 0.499817
Neighbors female 0.500065
male 0.499935
Name: proportion, dtype: float64
treatment g2000
Civic Duty yes 0.841724
no 0.158276
Neighbors yes 0.841653
no 0.158347
Name: proportion, dtype: float64
treatment g2002
Civic Duty yes 0.81111
no 0.18889
Neighbors yes 0.81134
no 0.18866
Name: proportion, dtype: float64
treatment g2004
Civic Duty yes 1.0
Neighbors yes 1.0
Name: proportion, dtype: float64
treatment p2000
Civic Duty no 0.746428
yes 0.253572
Neighbors no 0.748802
yes 0.251198
Name: proportion, dtype: float64
treatment p2002
Civic Duty no 0.611152
yes 0.388848
Neighbors no 0.613413
yes 0.386587
Name: proportion, dtype: float64
treatment p2004
Civic Duty No 0.600555
Yes 0.399445
Neighbors No 0.593335
Yes 0.406665
Name: proportion, dtype: float64
Are there any differences between the \(T=1\) and \(T=0\) groups in terms of these characteristics, or do they seem comparable?
Your response: All characteristics seem comparable with no significant differences between the two groups.
3. Causal effect estimation#
Finally, let’s estimate the ATE for the voted
outcome. From what we looked at in class, because this is a randomized experiment, we can estimate the ATE by computing the difference in means between the \(T=1\) and \(T=0\) groups for the voted
outcome. For this particular dataset, that means:
Tip
Try computing the two proportions separately and then subtracting them.
We can compute the proportion of respondents who voted ‘Yes’ in each treatment group by using the .value_counts()
method with normalize=True
and indexing on the 'Yes'
category.
vote_df.loc[vote_df['treatment'] == 'Neighbors']['voted'].value_counts(normalize=True)
voted
No 0.622052
Yes 0.377948
Name: proportion, dtype: float64
vote_df.loc[vote_df['treatment'] == 'Civic Duty']['voted'].value_counts(normalize=True)
voted
No 0.685462
Yes 0.314538
Name: proportion, dtype: float64
# TODO your code for estimating the ATE
# vote=yes, T=1 vote=yes, T=0
estimated_ATE = 0.377948 - 0.314538
estimated_ATE
0.06341000000000002
Write down an interpretation of this estimate as a causal quantity.
Your response: Applying social pressure through the mail causally increases voter turnout on average by 6.3% over mail that reminds them that voting is a civic duty.
References#
Gerber, Alan S., Donald P. Green, and Christopher W. Larimer, 2008, Replication Materials for “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” http://hdl.handle.net/10079/c7507a0d-097a-4689-873a-7424564dfc82. ISPS Data Archive.
Full article: https://isps.yale.edu/sites/default/files/publication/2012/12/ISPS08-001.pdf