Activity 4: Randomized Experiment Balance and Estimation in Gerber et al. 2008

Activity 4: Randomized Experiment Balance and Estimation in Gerber et al. 2008#

2025-09-15

# imports we'll need
import numpy as np
import pandas as pd

First, let’s load in the data at ~/COMSC-341CD/data/gerber2008.csv and take a look at the first few rows via DataFrame.head().

# TODO uncomment to load data
vote_df = pd.read_csv("~/COMSC-341CD/data/gerber2008.csv")

# TODO examining the top and bottom of the dataframe
vote_df.head()

	sex	yob	treatment	voted	hh_size	g2000	g2002	g2004	p2000	p2002	p2004
0	male	1941	Civic Duty	No	2	yes	yes	yes	no	yes	No
1	female	1947	Civic Duty	No	2	yes	yes	yes	no	yes	No
2	male	1951	Hawthorne	Yes	3	yes	yes	yes	no	yes	No
3	female	1950	Hawthorne	Yes	3	yes	yes	yes	no	yes	No
4	female	1982	Hawthorne	Yes	3	yes	yes	yes	no	yes	No

# get all of the unique values for column treatment
vote_df['treatment'].unique()

array(['Civic Duty', 'Hawthorne', 'Control', 'Self', 'Neighbors'],
      dtype=object)

# TODO giving a summary of the dataframe
vote_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344084 entries, 0 to 344083
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   sex        344084 non-null  object
 1   yob        344084 non-null  int64 
 2   treatment  344084 non-null  object
 3   voted      344084 non-null  object
 4   hh_size    344084 non-null  int64 
 5   g2000      344084 non-null  object
 6   g2002      344084 non-null  object
 7   g2004      344084 non-null  object
 8   p2000      344084 non-null  object
 9   p2002      344084 non-null  object
 10  p2004      344084 non-null  object
dtypes: int64(2), object(9)
memory usage: 28.9+ MB

# TODO get the column names
vote_df.columns

Index(['sex', 'yob', 'treatment', 'voted', 'hh_size', 'g2000', 'g2002',
       'g2004', 'p2000', 'p2002', 'p2004'],
      dtype='object')

# TODO get the number of rows and columns
vote_df.shape

(344084, 11)

# TODO return the data types of each column
vote_df.dtypes

sex          object
yob           int64
treatment    object
voted        object
hh_size       int64
g2000        object
g2002        object
g2004        object
p2000        object
p2002        object
p2004        object
dtype: object

The columns we have are:

sex: the respondent’s sex
yob: the respondent’s year of birth
treatment: the treatment the respondent received
voted: our outcome: whether the respondent voted in the 2006 primary election
hh_size: the size of the respondent’s household
g2000, g2002, g2004: whether the respondent voted in the 2000, 2002, and 2004 general election
p2000, p2002, p2004: whether the respondent voted in the 2000, 2002, and 2004 primary election

1. Data cleaning#

The full study has 4 treatment arms and a control group, but today we’re only interested in the Civic Duty and Neighbors arms. These correspond to the \(T=0\) and \(T=1\) interventions we discussed earlier:

\[\begin{split} T = \begin{cases} 1 & \text{mail stating that their household turnout would be publicized to their neighbors} \\ 0 & \text{mail reminding them that voting is a civic duty} \end{cases} \end{split}\]

Let’s clean the data to only include these two arms by filtering the treatment column.

Note

You may find the pandas.Series.isin() method useful here in combination with the logical indexing with .locwe saw on worksheet 2.

You may encounter a data issue with the treatment column – try to identify what it is and we’ll discuss how to fix it together.

vote_df.loc[vote_df['yob'] < 1950]

	sex	yob	treatment	voted	hh_size	g2000	g2002	g2004	p2000	p2002	p2004
0	male	1941	Civic Duty	No	2	yes	yes	yes	no	yes	No
1	female	1947	Civic Duty	No	2	yes	yes	yes	no	yes	No
10	male	1941	Control	Yes	1	yes	yes	yes	no	no	Yes
11	male	1945	Hawthorne	No	2	yes	yes	yes	no	no	No
12	female	1949	Hawthorne	Yes	2	yes	yes	yes	no	yes	No
...	...	...	...	...	...	...	...	...	...	...	...
344077	male	1942	Neighbors	No	2	yes	yes	yes	no	yes	No
344078	female	1944	Control	Yes	2	yes	yes	yes	no	no	No
344079	male	1943	Control	Yes	2	yes	yes	yes	no	yes	Yes
344082	male	1937	Control	Yes	2	yes	yes	yes	yes	yes	Yes
344083	female	1949	Control	Yes	2	yes	yes	yes	no	no	Yes

106259 rows × 11 columns

# TODO your code here selecting the 'Civic Duty' and 'Neighbors' rows
vote_df = vote_df.loc[vote_df['treatment'].isin(['Civic Duty', 'Neighbors'])]

Let’s generate an additional column that corresponds to the respondent’s age in 2006.

# TODO generate the 'age' column
vote_df['age'] = 2006 - vote_df['yob']

vote_df.head()

	sex	yob	treatment	voted	hh_size	g2000	g2002	g2004	p2000	p2002	p2004	age
0	male	1941	Civic Duty	No	2	yes	yes	yes	no	yes	No	65
1	female	1947	Civic Duty	No	2	yes	yes	yes	no	yes	No	59
19	male	1939	Neighbors	Yes	1	yes	yes	yes	no	yes	No	67
27	male	1965	Civic Duty	No	2	yes	yes	yes	yes	yes	No	41
28	female	1965	Civic Duty	No	2	yes	yes	yes	no	yes	No	41

2. Balance check#

One way we can check whether exchangeability is plausible is to check whether the distribution of data characteristics are the same between the treatment and control groups.

Let’s now check the balance of the various characteristics between the \(T=1\) ('Neighbors') and \(T=0\) ('Civic Duty') groups.

For the continuous variables of age and hh_size, let’s compute the mean for each group:

Tip

You may find the pandas.DataFrame.groupby() method from worksheet 2 useful here, or you can compute the means by slicing the dataframe using .loc.

vote_df['age'].mean()

# approach 1
vote_df.loc[T=1]['age'].mean()
vote_df.loc[T=0]['age'].mean()

# approach 2
vote_df.groupby('treatment')['age']

  Cell In[13], line 4
    vote_df.loc[T=1]['age'].mean()
                ^
SyntaxError: invalid syntax. Maybe you meant '==' or ':=' instead of '='?

# TODO your code for continuous variables

print(vote_df.groupby('treatment')['age'].mean())
print(vote_df.groupby('treatment')['hh_size'].mean())

treatment
Civic Duty    49.659035
Neighbors     49.852936
Name: age, dtype: float64
treatment
Civic Duty    2.189126
Neighbors     2.187770
Name: hh_size, dtype: float64

For the each of the categorical variables of ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004'], let’s compute the proportion of respondents who have the characteristic for each treatment group:

Tip

You may want to use the value_counts() method with normalize=True along with .groupby() to compute the proportions for each individual characteristic.

# TODO, group or select by T=1 and T=0
vote_df['g2000'].value_counts(normalize=True)

g2000
yes    0.841689
no     0.158311
Name: proportion, dtype: float64

# TODO your code for categorical variable comparison
categorical_vars = ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004']

# we might need a for loop here
for category in categorical_vars:
    print(vote_df.groupby('treatment')[category].value_counts(normalize=True))

treatment   sex   
Civic Duty  female    0.500183
            male      0.499817
Neighbors   female    0.500065
            male      0.499935
Name: proportion, dtype: float64
treatment   g2000
Civic Duty  yes      0.841724
            no       0.158276
Neighbors   yes      0.841653
            no       0.158347
Name: proportion, dtype: float64
treatment   g2002
Civic Duty  yes      0.81111
            no       0.18889
Neighbors   yes      0.81134
            no       0.18866
Name: proportion, dtype: float64
treatment   g2004
Civic Duty  yes      1.0
Neighbors   yes      1.0
Name: proportion, dtype: float64
treatment   p2000
Civic Duty  no       0.746428
            yes      0.253572
Neighbors   no       0.748802
            yes      0.251198
Name: proportion, dtype: float64
treatment   p2002
Civic Duty  no       0.611152
            yes      0.388848
Neighbors   no       0.613413
            yes      0.386587
Name: proportion, dtype: float64
treatment   p2004
Civic Duty  No       0.600555
            Yes      0.399445
Neighbors   No       0.593335
            Yes      0.406665
Name: proportion, dtype: float64

Are there any differences between the \(T=1\) and \(T=0\) groups in terms of these characteristics, or do they seem comparable?

Your response: All characteristics seem comparable with no significant differences between the two groups.

3. Causal effect estimation#

Finally, let’s estimate the ATE for the voted outcome. From what we looked at in class, because this is a randomized experiment, we can estimate the ATE by computing the difference in means between the \(T=1\) and \(T=0\) groups for the voted outcome. For this particular dataset, that means:

\[\begin{split} \begin{align*} \text{ATE} = E[Y(1) - Y(0)] \xrightarrow[]{\text{Identification}} \; &E[Y | T=1] - E[Y | T=0]\\ E[Y | T=1] - E[Y | T=0] \; \xrightarrow[]{\text{Estimation}} \; &\hat{E}[Y | T=1] - \hat{E}[Y | T=0]\\ =&\hat{P}(\text{voted='Yes'} \mid \text{treatment='Neighbors'}) - \hat{P}(\text{voted='Yes'} \mid \text{treatment='Civic Duty'}) \end{align*} \end{split}\]

Tip

Try computing the two proportions separately and then subtracting them.

We can compute the proportion of respondents who voted ‘Yes’ in each treatment group by using the .value_counts() method with normalize=True and indexing on the 'Yes' category.

vote_df.loc[vote_df['treatment'] == 'Neighbors']['voted'].value_counts(normalize=True)

voted
No     0.622052
Yes    0.377948
Name: proportion, dtype: float64

vote_df.loc[vote_df['treatment'] == 'Civic Duty']['voted'].value_counts(normalize=True)

voted
No     0.685462
Yes    0.314538
Name: proportion, dtype: float64

# TODO your code for estimating the ATE
#               vote=yes, T=1    vote=yes, T=0
estimated_ATE = 0.377948      -  0.314538
estimated_ATE

0.06341000000000002

Write down an interpretation of this estimate as a causal quantity.

Your response: Applying social pressure through the mail causally increases voter turnout on average by 6.3% over mail that reminds them that voting is a civic duty.

References#

Gerber, Alan S., Donald P. Green, and Christopher W. Larimer, 2008, Replication Materials for “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” http://hdl.handle.net/10079/c7507a0d-097a-4689-873a-7424564dfc82. ISPS Data Archive.
Full article: https://isps.yale.edu/sites/default/files/publication/2012/12/ISPS08-001.pdf