Worksheet 6 🤖

Worksheet 6 🤖#

linearmodels with Large Language Models

—TODO your name here

Collaboration Statement

TODO brief statement on the nature of your collaboration.
TODO your collaborator’s names here

Learning Objectives#

Gain experience using linearmodels package to perform quasi-experiment estimation
Explore how LLMs can be integrated into your workflow for data analysis

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

`linearmodels` (LMs)#

Throughout the semester, we have been building estimators for different causal study designs from the ground up, translating mathematical formulas into code using numpy, pandas, and statsmodels. The reason why we’ve taken this approach is to get a strong conceptual foundation on how the different estimators work.

In practice, there exist many packages in both R and Python that implement causal inference methods for us. Most applied researchers, policymakers, and scientists use these off-the-shelf packages to perform estimation. We’ll take a look at one such package in Python, linearmodels, to help us perform estimation in the fuzzy regression discontinuity design and difference-in-differences designs.

Note

A key advantage that R has over Python is that cutting-edge causal inference methods tend to be implemented in R first, though Python support is now catching up. As a data scientist, it is useful to be familiar with both languages and computing environments to be able to use the best tool for the job.

linearmodels is modeled to resemble the syntax and structure of statsmodels, but with some extensions for direct support of panel data and instrumental variable models. See the user guides below for more details on how to use the linearmodels package:

Large Language Models (LLMs)#

In this worksheet, we’ll also explore how to integrate Large Language Models (LLMs) into a data analysis workflow. These tools are transforming how we approach programming and data science, offering powerful support in code generation and problem-solving. However, their effectiveness depends entirely on our ability to engage with them mindfully. While LLMs can be valuable assistants, they are not replacements for our own understanding of the concepts and fundamentals. These models, despite their impressive capabilities, can generate (often confidently!) incorrect or misleading output. In order to use LLMs effectively, we need to understand when and how to effectively incorporate their assistance and critically evaluate their suggestions.

For this worksheet, you are allowed to consult an LLM in the indicated sections. Some options to use include:

ChatGPT by OpenAI
Claude by Anthropic
Gemini by Google

1. Fuzzy regression discontinuity design estimation#

As we discussed in the RDD II lecture, the fuzzy regression discontinuity design can be treated as if it were a instrumental variable design, where given a running variable \(R\) and a cutoff \(c\), we can define an instrument \(Z\) as:

\[ Z = \mathbb{I}(R \geq c) \]

where \(c\) is the cutoff value for the running variable \(R\). We additionally define two additional covariates \(X_1\) and \(X_2\) as:

\[ X_1 = Z \cdot (R - c) \]

\[ X_2 = (1-Z) \cdot (R - c) \]

The fuzzy regression discontinuity design can then be estimated using a two-stage least squares (TSLS) regression, where we regress the outcome \(Y\) on the instrument \(Z\) and the covariates \(X_1\) and \(X_2\), using data only within the bandwidth \(b\) of the cutoff \(c\): \(R \in [c-b, c+b]\).

1.1 FRDD data preparation [0.5 pts]#

Complete the function below to prepare a dataframe for fuzzy regression discontinuity design estimation:

def prepare_frdd_data(df, treatment_col, outcome_col, running_col, cutoff, bandwidth):
    """
    Prepare a dataframe for fuzzy regression discontinuity design estimation. Specifically:

    1. selects the data within the bandwidth of the cutoff
    2. creates the instrument Z as an indicator for whether the running variable is >= cutoff
    3. creates the covariates X1 and X2 as:
        - X1 = Z * (R - c)
        - X2 = (1-Z) * (R - c)
    Args:
        df (pd.DataFrame): The input data frame containing the variables.
        treatment_col (str): The name of the treatment variable.
        outcome_col (str): The name of the outcome variable.
        running_col (str): The name of the running variable.
        cutoff (float): The cutoff value for the running variable.
        bandwidth (float): The bandwidth for the fuzzy regression discontinuity design.

    Returns:
        pd.DataFrame: A dataframe with the selected data, the instrument Z, and the covariates X1 and X2
    """
    frdd_df = df.copy()

    # TODO select the data within the bandwidth of the cutoff

    # TODO create the instrument Z

    # TODO create covariates X1 and X2

    return frdd_df


if __name__ == "__main__":
    # Creates a simple dataset for testing
    test_df = pd.DataFrame({
        'Y': [1, 2, 3, 4, 5],
        'T': [0, 1, 0, 1, 1],
        'R': [0, 1, 2, 3, 4]
    })
    cutoff = 2
    bandwidth = 1
    
    frdd_df = prepare_frdd_data(test_df, 'T', 'Y', 'R', cutoff, bandwidth)

    assert frdd_df.shape[0] == 3, "The dataframe should have 3 rows, where the running variable is between 1 and 3"
    assert 'Z' in frdd_df.columns, "The dataframe should have a column for the instrument Z"
    assert 'X1' in frdd_df.columns, "The dataframe should have a column for the covariate X1"
    assert 'X2' in frdd_df.columns, "The dataframe should have a column for the covariate X2"

    # feel free to add more tests!

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 44
     40 bandwidth = 1
     42 frdd_df = prepare_frdd_data(test_df, 'T', 'Y', 'R', cutoff, bandwidth)
---> 44 assert frdd_df.shape[0] == 3, "The dataframe should have 3 rows, where the running variable is between 1 and 3"
     45 assert 'Z' in frdd_df.columns, "The dataframe should have a column for the instrument Z"
     46 assert 'X1' in frdd_df.columns, "The dataframe should have a column for the covariate X1"

AssertionError: The dataframe should have 3 rows, where the running variable is between 1 and 3

1.2 FRDD case study: mortgage subsidies and homeownership [1 pt]#

Let’s now apply the fuzzy regression discontinuity design to a dataset studying the effects of mortgage subsidies on homeownership (Fetter 2013).

Prior knowledge/policy question: The U.S. has been experiencing a housing affordability and ownership crisis since the mid 2000s, so policymakers have been interested in the effects of potential federal intervention in the housing market. In the mid 1900s, the U.S. goverment provided mortgage subsidies to military veterans of the Korean War and World War II. We can use veteran status and eligibility to serve in the U.S. military to provide historical evidence on the effects of goverment subsidies on homeownership rates.
Causal question: How does military veteran status of the Korean War and World War II affect homeownership rates (by way of the mortgage subsidies given to veterans)?
Running variable: birth months relative to eligibility to serve in the U.S. military during the Korean War and World War II
- qob_minus_kw in the data
Cutoff: end of both the Korean War (1953) and World War II (1945)
- this has been normalized to 0 in the data, cutoff = 0
Treatment: Whether or not an individual was a veteran
- vet_wwko in the data
Outcome: homeownership rate
- home_ownership in the data: 1 if they own a home, 0 otherwise

This data lends itself to a fuzzy regression discontinuity design, where we can expect a discontinuous shift in veteran status at the end of the Korean War and World War II. However, it is not a sharp discontinuity because not all individuals eligible to serve in the U.S. military during those wars actually served.

Following the same process as Activity 13, bin the running variable (qob_minus_kw) and then create a point plot of the bins against the treatment (vet_wwko).

mortgage_df = pd.read_csv("~/COMSC-341CD/data/veteran_mortgages.csv")

# TODO bin the running variable with 25 bins, labels=False
#mortgage_df['running_bin'] = None

# TODO plot the point plot of the bins against the treatment, with linestyle='none'
#sns.pointplot(TODO)
plt.axvline(x=12, color='black', linestyle='--')
# convert xticks to the same units as the original running variable
plt.xticks(ticks=[0, 6, 12, 18, 24], labels=[-12, -6, 0, 6, 12])
plt.xlabel("Months relative to eligibility to serve in the Korean War")

Takeaway (click this once you’ve completed the plot)

You should observe that there isn’t quite an obvious “jump” in veteran status at the cutoff, due to the many factors that could influence whether a person serves in the military (and thus becomes a veteran) outside of the running variable. However, there should be a nearly horizontal trend line for people born before the eligibility cutoff, which then drops drastically as the data get closer to the cutoff before levelling off again.

Instead of manually implementing the two-stage least squares (TSLS) regression like we’ve done in the past, we can use linearmodels’ IV2SLS to estimate the fuzzy regression discontinuity design. The formula based syntax for IV2SLS is very similar to that of statsmodels. The new piece of syntax is that linearmodels expects a block '[T ~ Z]' in the string to specify the relationship between the treatment and instrument. For example, to fully specify the fuzzy regression discontinuity design regression, we would use the following formula:

frdd_formula = 'Y ~ 1 + X1 + X2 + [T ~ Z]'

Fitting the model then looks as follows:

from linearmodels.iv import IV2SLS

iv_model = IV2SLS.from_formula(frdd_formula, data)
iv_results = iv_model.fit()

Use linearmodels to estimate the LATE at the cutoff for the data, using a bandwidth of \(b=6\) and a cutoff of \(c=0\). The frdd_results.summary will produce a lot of statistical output. The main number we want to look at is the estimate of the local average treatment effect (LATE) at the cutoff, which is given by the “Parameter” on the treatment variable, vet_wwko.

from linearmodels.iv import IV2SLS

# TODO call prepare_frdd_data with the data, treatment_col, outcome_col, running_col, cutoff, bandwidth
frdd_df = None

# TODO fit the IV2SLS model using the appropriate variables for the outcome, treatment, instrument, and X1, X2 covariates
frdd_formula = None

# TODO call IV2SLS.from_formula with the formula=frdd_formula and the data=frdd_df
#frdd_model = IV2SLS.from_formula(TODO)
#frdd_results = frdd_model.fit()
#print(frdd_results.summary)

Takeaway (click this once you’ve completed the linearmodels estimation)

The estimate using a bandwidth of 6 should be that the local average treatment effect at the cutoff for being a veteran increases the probability of homeownership by 37.7%, with a p-value of about 0.01.

1.3 LLM roleplay prompting [1 pt]#

One effective way to get better results from LLMs is through roleplay prompting: giving the model a specific role and context to work within. This is similar to how we might ask a person to “think like a ______” or “explain this to me like I’m 5”.

First, try asking an LLM (like Claude or ChatGPT) to help you interpret the FRDD results we just obtained, using the following basic prompt:

I’ve just run a fuzzy regression discontinuity design analysis on the effect of veteran status on homeownership rates. The running variable is birth months relative to eligibility to serve in the Korean War and World War II. The local average treatment effect estimate is [TODO insert estimate here] with a p-value of [TODO insert p-value here]. Can you help me interpret these results?

Next, open up a new chat with the LLM (so that there is no previous history from your last prompt) and try the same question but with a roleplay prompt sentence at the beginning:

You are an experienced data scientist specializing in causal inference methods. I’ve just run a fuzzy regression discontinuity design…

1. Summarize any causal design considerations discussed in each response. For example, are there mentions of assumptions like continuity and monotonicity, or discussions of compliance? Are any of the statements about these concepts erroneous or misleading?

Basic:
- TODO
Roleplay prompt:
- TODO

2. How did the LLM’s explanations of causal design concepts compare to our own discussions of them in class? Which prompt provided the more useful and contextually appropriate explanation?

Your response: TODO

Another aspect of using LLMs to keep in mind is that they are stochastic – the same prompt may give different responses each time. Open another new chat with the LLM and ask the roleplay prompt question again. Does the LLM omit or provide more information compared to its first response?

Your response: TODO

Optional explorations

You can additionally experiment with the stability of LLM responses by asking the same question but replacing the roleplay prompt with a different role e.g., “political scientist,” “economist,” “computer scientist,” or “cat”.

2. Difference-in-differences implementation#

2.1. Diff-in-diff data preparation [0.5 pts]#

Complete the function below to prepare a dataset for diff-in-diff analysis. In particular, we need to:

create a binary column for post_treatment given the time period
- For example, if the dataset has 3 time periods \(t=1, 2, 3\) and the treatment is applied at time step 2.5, then the rows with \(t=3\) should be post_treatment = 1 and the rest should be post_treatment = 0
create a binary column treated which is 1 if the unit is in the treated group AND is after the treatment.
- For example, from our traffic law case study in Activity 14, the treated column is created by checking if a row has town == "South Hadley" and post_treatment == 1:
  
  town
  
  time
  
  outcome
  
  post_treatment
  
  treated
  
  South Hadley
  
  1
  
  100
  
  0
  
  0
  
  South Hadley
  
  2
  
  90
  
  0
  
  0
  
  South Hadley
  
  3
  
  70
  
  1
  
  1
  
  Hadley
  
  1
  
  80
  
  0
  
  0
  
  Hadley
  
  2
  
  70
  
  0
  
  0
  
  Hadley
  
  3
  
  60
  
  1
  
  0
set the index of the dataframe to be the unit and time variables

def prepare_did_data(data, unit_col, time_col, outcome_col, treated_group, treatment_time):
    """
    Prepare a dataframe for difference-in-differences analysis.
    
    Args:
        data (pd.DataFrame): The input data frame containing the variables.
        unit_col (str): The name of the unit variable.
        time_col (str): The name of the time variable.
        outcome_col (str): The name of the outcome variable.
        treated_group (str): The name of the treated group.
        treatment_time (int): The time period when the treatment is applied.
    Returns:
        pd.DataFrame: A dataframe with the treatment indicator, post-treatment indicator, and treatment status.
    """
    did_df = data.copy()

    # TODO generate the post_treatment column based on the treatment_time
    
    # TODO generate the treated column based on post_treatment and treated_group

    # TODO set the index of the dataframe to be [unit_col, time_col]

    return did_df


if __name__ == "__main__":
    traffic_df = pd.DataFrame(
        {
            'town': ['South Hadley', 'South Hadley', 'South Hadley', 'Hadley', 'Hadley', 'Hadley'],
            'time': [1, 2, 3, 1, 2, 3],
            'outcome': [100, 90, 70, 80, 70, 60],
        }
    )

    did_df = prepare_did_data(traffic_df, 'town', 'time', 'outcome', 'South Hadley', 2.5)

    assert did_df.index.names == ['town', 'time'], "The index of the dataframe should be [unit_col, time_col]"
    assert did_df['post_treatment'].tolist() == [0, 0, 1, 0, 0, 1], "The post_treatment column should be 1 for time >= 2.5"
    assert did_df['treated'].tolist() == [0, 0, 1, 0, 0, 0], "The treated column should be 1 for South Hadley and post_treatment == 1"

    # feel free to add more tests!
    

2.2 Diff-in-diff case study: organ donation [0.5 pts]#

We’ll complete estimation of the organ donation case study from class using linearmodels.

The estimation process for a difference-in-differences design also utilizes linear regression, but now we have covariates that account for both the “group” a unit is in (treated vs control) as well as time:

\[ Y = \beta_{\text{group}} + \beta_{\text{time}} + \beta_1 T + \epsilon \]

Like the previous cases where we used linear regression, \(\beta_1\) is the estimate of our causal quantity, the average treatment efect on the treated (ATT).

Notice how the two new \(\beta\) coefficients correspond to the two indices we discussed in the diff-in-diff I lecture for panel data. We can think of these two beta coefficients as the group and time effects, which are kept constant or “fixed” within the groups in the data and within time steps. This regression is thus often referred to as a “two-way fixed effects” model. The main advantage a two-way fixed effects model has over a simpler interacted linear regression model is that is allows us to account for multiple time periods, which makes use of more data and thus reduces the variance of our estimate.

To fit this model, we use linearmodels’ PanelOLS model.

This model expects a dataframe with a multi-index, where the first level is the unit and the second level is the time period. Then, to fit the regression specified above, we would use the following formula:

did_formula = 'outcome ~ EntityEffects + TimeEffects + treated'

EntityEffects and TimeEffects are special keywords that tell the model to include the unit and time fixed effects, respectively. We would then fit the model as follows:

from linearmodels.panel import PanelOLS

did_model = PanelOLS.from_formula(did_formula, data)
did_results = did_model.fit(cov_type='clustered')

Statistical aside

We pass cov_type='clustered' to the fit method to account for the fact that the data is “clustered” by the unit (in other words, the data is not independent within each unit), which will reduce the variance of our estimate.

For our organ donation case study, use linearmodels to estimate the ATT for California, with the following parameters:

outcome_col: Rate
unit_col: State
time_col: Quarter_Num
treated_group: California
treatment_time: 3.5

Like with the fuzzy regression discontinuity design, the did_results.summary will produce a lot of statistical output. The main number we want to look at is the estimate of the ATT, which is given by the “Parameter” on the treatment variable, treated.

from linearmodels.panel import PanelOLS

organ_df = pd.read_csv("~/COMSC-341CD/data/organ_donations.csv")

# TODO call prepare_did_data with the appropriate parameters


# TODO fit the model using the appropriate variables for the outcome, treatment, and covariates
#did_formula = ''
# did_model = PanelOLS.from_formula(TODO)
# did_results = did_model.fit(cov_type='clustered')
# print(did_results.summary)

Takeaway (click this once you’ve completed the linearmodels estimation)

The ATT estimate using a treatment time of 3.5 for using active choice over opt-in questions in California should be a reduction in the rate of organ donations by 2.25%, with a p-value of <0.01.

2.3 LLM chain of thought prompting [0.5 pts]#

Another effective prompting technique is chain of thought prompting, where we break down a complex task into smaller, more manageable pieces. This helps the LLM reason through the problem more systematically, similar to how a person might work through a problem step by step.

1. Complete the following prompt, filling out the TODOs by providing descriptions of the steps:

Can you help me conduct a difference-in-differences analysis on organ donation data in Python? Let’s break this down step by step.

First: TODO description of what the prepare_did_data method should do – you can provide a signature and a description of the method.

Next: apply the prepare_did_data method to the organ donation data, called organ_df. TODO description of the data – you should tell the LLM what the names of all of the relevant columns are.

Finally: TODO description of what the linearmodels PanelOLS model should do for diff-in-diff analysis.

2. Paste the code the LLM generates based on your prompt in the cell below and run it. Does it replicate the results from above? Does it generate any unnecessary or incorrect code?

Your response: TODO

# TODO paste the code generated by the LLM here, and run it

3. Reflection#

3.1 LLM usage [0.5 pts]#

1. In this worksheet, we used LLMs to help with both code generation and interpretation of results. We took specific care to structure prompts appropriately and to provide the LLM with the correct context. What does this tell you about the importance of understanding the underlying concepts even when using LLMs?

Your response: TODO

2. The LLM responses you received were likely different from what your classmates received, even with the same prompts. How does this variability affect how you might use LLMs in your future work? What strategies might you use to handle this uncertainty?

Your response: TODO

3.2 Worksheet reflections [0.5 pts]#

1. How much time did it take you to complete this worksheet?

Your Response: TODO

2. What is one thing you have a better understanding of after completing this worksheet and going through the class content this week? This could be about the concepts, the reading, or the code.

Your Response: TODO

3. What questions or points of confusion do you have about the material covered covered in the past week of class?

Your Response: TODO

Tip

Don’t forget to check that all of the TODOs are filled in before you submit!

Acknowledgements#

This worksheet builds off principles from Anthropic’s Prompt Engineering Tutorial and uses data from Nick Huntington-Klein’s causaldata package.

town	time	outcome	post_treatment	treated
South Hadley	1	100	0	0
South Hadley	2	90	0	0
South Hadley	3	70	1	1
Hadley	1	80	0	0
Hadley	2	70	0	0
Hadley	3	60	1	0

Worksheet 6 🤖

Contents

Worksheet 6 🤖#

Learning Objectives#

linearmodels (LMs)#

Large Language Models (LLMs)#

1. Fuzzy regression discontinuity design estimation#

1.1 FRDD data preparation [0.5 pts]#

1.2 FRDD case study: mortgage subsidies and homeownership [1 pt]#

1.3 LLM roleplay prompting [1 pt]#

2. Difference-in-differences implementation#

2.1. Diff-in-diff data preparation [0.5 pts]#

2.2 Diff-in-diff case study: organ donation [0.5 pts]#

2.3 LLM chain of thought prompting [0.5 pts]#

3. Reflection#

3.1 LLM usage [0.5 pts]#

3.2 Worksheet reflections [0.5 pts]#

Acknowledgements#

`linearmodels` (LMs)#