Activity 4: Randomized Experiment Balance and Estimation in Gerber et al. 2008#

2025-09-15


# imports we'll need
import numpy as np
import pandas as pd

First, let’s load in the data at ~/COMSC-341CD/data/gerber2008.csv and take a look at the first few rows via DataFrame.head().

# TODO uncomment to load data
vote_df = pd.read_csv("~/COMSC-341CD/data/gerber2008.csv")
# TODO examining the top and bottom of the dataframe
vote_df.head()
sex yob treatment voted hh_size g2000 g2002 g2004 p2000 p2002 p2004
0 male 1941 Civic Duty No 2 yes yes yes no yes No
1 female 1947 Civic Duty No 2 yes yes yes no yes No
2 male 1951 Hawthorne Yes 3 yes yes yes no yes No
3 female 1950 Hawthorne Yes 3 yes yes yes no yes No
4 female 1982 Hawthorne Yes 3 yes yes yes no yes No
# get all of the unique values for column treatment
vote_df['treatment'].unique()
array(['Civic Duty', 'Hawthorne', 'Control', 'Self', 'Neighbors'],
      dtype=object)
# TODO giving a summary of the dataframe
vote_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344084 entries, 0 to 344083
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   sex        344084 non-null  object
 1   yob        344084 non-null  int64 
 2   treatment  344084 non-null  object
 3   voted      344084 non-null  object
 4   hh_size    344084 non-null  int64 
 5   g2000      344084 non-null  object
 6   g2002      344084 non-null  object
 7   g2004      344084 non-null  object
 8   p2000      344084 non-null  object
 9   p2002      344084 non-null  object
 10  p2004      344084 non-null  object
dtypes: int64(2), object(9)
memory usage: 28.9+ MB
# TODO get the column names
vote_df.columns
Index(['sex', 'yob', 'treatment', 'voted', 'hh_size', 'g2000', 'g2002',
       'g2004', 'p2000', 'p2002', 'p2004'],
      dtype='object')
# TODO get the number of rows and columns
vote_df.shape
(344084, 11)
# TODO return the data types of each column
vote_df.dtypes
sex          object
yob           int64
treatment    object
voted        object
hh_size       int64
g2000        object
g2002        object
g2004        object
p2000        object
p2002        object
p2004        object
dtype: object

The columns we have are:

  • sex: the respondent’s sex

  • yob: the respondent’s year of birth

  • treatment: the treatment the respondent received

  • voted: our outcome: whether the respondent voted in the 2006 primary election

  • hh_size: the size of the respondent’s household

  • g2000, g2002, g2004: whether the respondent voted in the 2000, 2002, and 2004 general election

  • p2000, p2002, p2004: whether the respondent voted in the 2000, 2002, and 2004 primary election

1. Data cleaning#

The full study has 4 treatment arms and a control group, but today we’re only interested in the Civic Duty and Neighbors arms. These correspond to the \(T=0\) and \(T=1\) interventions we discussed earlier:

\[\begin{split} T = \begin{cases} 1 & \text{mail stating that their household turnout would be publicized to their neighbors} \\ 0 & \text{mail reminding them that voting is a civic duty} \end{cases} \end{split}\]

Let’s clean the data to only include these two arms by filtering the treatment column.

Note

You may find the pandas.Series.isin() method useful here in combination with the logical indexing with .locwe saw on worksheet 2.

You may encounter a data issue with the treatment column – try to identify what it is and we’ll discuss how to fix it together.

vote_df.loc[vote_df['yob'] < 1950]
sex yob treatment voted hh_size g2000 g2002 g2004 p2000 p2002 p2004
0 male 1941 Civic Duty No 2 yes yes yes no yes No
1 female 1947 Civic Duty No 2 yes yes yes no yes No
10 male 1941 Control Yes 1 yes yes yes no no Yes
11 male 1945 Hawthorne No 2 yes yes yes no no No
12 female 1949 Hawthorne Yes 2 yes yes yes no yes No
... ... ... ... ... ... ... ... ... ... ... ...
344077 male 1942 Neighbors No 2 yes yes yes no yes No
344078 female 1944 Control Yes 2 yes yes yes no no No
344079 male 1943 Control Yes 2 yes yes yes no yes Yes
344082 male 1937 Control Yes 2 yes yes yes yes yes Yes
344083 female 1949 Control Yes 2 yes yes yes no no Yes

106259 rows × 11 columns

# TODO your code here selecting the 'Civic Duty' and 'Neighbors' rows
vote_df = vote_df.loc[vote_df['treatment'].isin(['Civic Duty', 'Neighbors'])]

Let’s generate an additional column that corresponds to the respondent’s age in 2006.

# TODO generate the 'age' column
vote_df['age'] = 2006 - vote_df['yob']
vote_df.head()
sex yob treatment voted hh_size g2000 g2002 g2004 p2000 p2002 p2004 age
0 male 1941 Civic Duty No 2 yes yes yes no yes No 65
1 female 1947 Civic Duty No 2 yes yes yes no yes No 59
19 male 1939 Neighbors Yes 1 yes yes yes no yes No 67
27 male 1965 Civic Duty No 2 yes yes yes yes yes No 41
28 female 1965 Civic Duty No 2 yes yes yes no yes No 41

2. Balance check#

One way we can check whether exchangeability is plausible is to check whether the distribution of data characteristics are the same between the treatment and control groups.

Let’s now check the balance of the various characteristics between the \(T=1\) ('Neighbors') and \(T=0\) ('Civic Duty') groups.

For the continuous variables of age and hh_size, let’s compute the mean for each group:

Tip

You may find the pandas.DataFrame.groupby() method from worksheet 2 useful here, or you can compute the means by slicing the dataframe using .loc.

vote_df['age'].mean()

# approach 1
vote_df.loc[T=1]['age'].mean()
vote_df.loc[T=0]['age'].mean()

# approach 2
vote_df.groupby('treatment')['age']
  Cell In[13], line 4
    vote_df.loc[T=1]['age'].mean()
                ^
SyntaxError: invalid syntax. Maybe you meant '==' or ':=' instead of '='?
# TODO your code for continuous variables

print(vote_df.groupby('treatment')['age'].mean())
print(vote_df.groupby('treatment')['hh_size'].mean())
treatment
Civic Duty    49.659035
Neighbors     49.852936
Name: age, dtype: float64
treatment
Civic Duty    2.189126
Neighbors     2.187770
Name: hh_size, dtype: float64

For the each of the categorical variables of ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004'], let’s compute the proportion of respondents who have the characteristic for each treatment group:

Tip

You may want to use the value_counts() method with normalize=True along with .groupby() to compute the proportions for each individual characteristic.

# TODO, group or select by T=1 and T=0
vote_df['g2000'].value_counts(normalize=True)
g2000
yes    0.841689
no     0.158311
Name: proportion, dtype: float64
# TODO your code for categorical variable comparison
categorical_vars = ['sex', 'g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004']

# we might need a for loop here
for category in categorical_vars:
    print(vote_df.groupby('treatment')[category].value_counts(normalize=True))
treatment   sex   
Civic Duty  female    0.500183
            male      0.499817
Neighbors   female    0.500065
            male      0.499935
Name: proportion, dtype: float64
treatment   g2000
Civic Duty  yes      0.841724
            no       0.158276
Neighbors   yes      0.841653
            no       0.158347
Name: proportion, dtype: float64
treatment   g2002
Civic Duty  yes      0.81111
            no       0.18889
Neighbors   yes      0.81134
            no       0.18866
Name: proportion, dtype: float64
treatment   g2004
Civic Duty  yes      1.0
Neighbors   yes      1.0
Name: proportion, dtype: float64
treatment   p2000
Civic Duty  no       0.746428
            yes      0.253572
Neighbors   no       0.748802
            yes      0.251198
Name: proportion, dtype: float64
treatment   p2002
Civic Duty  no       0.611152
            yes      0.388848
Neighbors   no       0.613413
            yes      0.386587
Name: proportion, dtype: float64
treatment   p2004
Civic Duty  No       0.600555
            Yes      0.399445
Neighbors   No       0.593335
            Yes      0.406665
Name: proportion, dtype: float64

Are there any differences between the \(T=1\) and \(T=0\) groups in terms of these characteristics, or do they seem comparable?

Your response: All characteristics seem comparable with no significant differences between the two groups.

3. Causal effect estimation#

Finally, let’s estimate the ATE for the voted outcome. From what we looked at in class, because this is a randomized experiment, we can estimate the ATE by computing the difference in means between the \(T=1\) and \(T=0\) groups for the voted outcome. For this particular dataset, that means:

\[\begin{split} \begin{align*} \text{ATE} = E[Y(1) - Y(0)] \xrightarrow[]{\text{Identification}} \; &E[Y | T=1] - E[Y | T=0]\\ E[Y | T=1] - E[Y | T=0] \; \xrightarrow[]{\text{Estimation}} \; &\hat{E}[Y | T=1] - \hat{E}[Y | T=0]\\ =&\hat{P}(\text{voted='Yes'} \mid \text{treatment='Neighbors'}) - \hat{P}(\text{voted='Yes'} \mid \text{treatment='Civic Duty'}) \end{align*} \end{split}\]

Tip

Try computing the two proportions separately and then subtracting them.

We can compute the proportion of respondents who voted ‘Yes’ in each treatment group by using the .value_counts() method with normalize=True and indexing on the 'Yes' category.

vote_df.loc[vote_df['treatment'] == 'Neighbors']['voted'].value_counts(normalize=True)
voted
No     0.622052
Yes    0.377948
Name: proportion, dtype: float64
vote_df.loc[vote_df['treatment'] == 'Civic Duty']['voted'].value_counts(normalize=True)
voted
No     0.685462
Yes    0.314538
Name: proportion, dtype: float64
# TODO your code for estimating the ATE
#               vote=yes, T=1    vote=yes, T=0
estimated_ATE = 0.377948      -  0.314538
estimated_ATE
0.06341000000000002

Write down an interpretation of this estimate as a causal quantity.

Your response: Applying social pressure through the mail causally increases voter turnout on average by 6.3% over mail that reminds them that voting is a civic duty.

References#