Activity 10: Vectorization

Activity 10: Vectorization#

2025-03-25

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)

Logical indexing and vectorization#

Let’s briefly explore the benefits of “vectorization” in NumPy and Pandas, which is the practice of performing operations on entire arrays at once, rather than element-by-element via a loop.

We’ll compute the numerator of the standardized difference calculation \(d_X\) for Project 2, which is essentially the difference-in-means estimator for covariate \(X\):

\[ d_X = \frac{\hat{E}[X \mid T=1] - \hat{E}[X \mid T=0]} { \large \sqrt{\frac{\hat{V}[X \mid T=1] + \hat{V}[X \mid T=0]} {2}}} \]

Recall that \(\hat{E}[X \mid T=t]\) is the estimated mean of \(X\) given that \(T=t\).

Each code cell below has a %%time magic command that times the execution of the cell. You can use this to compare the performance of vectorized and looped code.

%%time
# short demo of the time magic command

for i in range(1000):
    pass

CPU times: user 29 μs, sys: 1e+03 ns, total: 30 μs
Wall time: 32.2 μs

# load in a simulated dataset with 50000 rows and T, X columns
df = pd.read_csv("~/COMSC-341CD/data/activity10_sim_data.csv")

df.head()

	T	X
0	1	10.468910
1	0	0.250017
2	1	10.110894
3	1	9.958819
4	1	10.287618

1.1: code with loop#

# %%time 
# TODO uncomment the above line once you have the code working to see how long it takes to run.

# TODO implement the code using the loop and .loc indexing.
T1_sum = 0
T0_sum = 0

# sum up the values of X for each treatment group by iterating over the rows of the DataFrame
# Python note: "_" is a convention for a variable that is not used in the loop
for _, row in df.iterrows():
    # TODO your code here
    pass

# calculate the mean of X for each treatment group
T1_mean = T1_sum / df[df["T"] == 1].shape[0]
T0_mean = T0_sum / df[df["T"] == 0].shape[0]

# the difference should be close to 10 for the simulated data
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[4], line 19
     16 T0_mean = T0_sum / df[df["T"] == 0].shape[0]
     18 # the difference should be close to 10 for the simulated data
---> 19 assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)

AssertionError: 

What is the wall clock time it takes for your code to run?

Your response: pollev.com/tliu

1.2 vectorized code#

# %%time 
# TODO uncomment the above line once you have the code working to see how long it takes to run.

# TODO implement the code with no loops, using logical indexing and the mean() function. It should be much shorter than the code with the loop.
T1_mean = 0
T0_mean = 0


# the difference should be close to 10
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)

What is the wall clock time it takes for your code to run?

Your response: pollev.com/tliu

Takeaway

If you’re writing code processing NumPy arrays or pandas DataFrames and you find yourself writing a loop over the length of an array or DataFrame, try to think about whether you can vectorize the operation. Unless the operation explicitly calls for iterating over the rows in a DataFrame (like we do in implementing greedy pair matching for Project 2), it is likely that you can achieve the same result without the loop, resulting in both faster and more concise code.

Activity 10: Vectorization

Contents

Activity 10: Vectorization#

Logical indexing and vectorization#

1.1: code with loop#

1.2 vectorized code#