Activity 10: Vectorization#
2025-03-25
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
Logical indexing and vectorization#
Let’s briefly explore the benefits of “vectorization” in NumPy and Pandas, which is the practice of performing operations on entire arrays at once, rather than element-by-element via a loop.
We’ll compute the numerator of the standardized difference calculation \(d_X\) for Project 2, which is essentially the difference-in-means estimator for covariate \(X\):
Recall that \(\hat{E}[X \mid T=t]\) is the estimated mean of \(X\) given that \(T=t\).
Each code cell below has a %%time
magic command that times the execution of the cell. You can use this to compare the performance of vectorized and looped code.
%%time
# short demo of the time magic command
for i in range(1000):
pass
CPU times: user 17 μs, sys: 0 ns, total: 17 μs
Wall time: 17.9 μs
# load in a simulated dataset with 50000 rows and T, X columns
df = pd.read_csv("~/COMSC-341CD/data/activity10_sim_data.csv")
df.head()
T | X | |
---|---|---|
0 | 1 | 10.468910 |
1 | 0 | 0.250017 |
2 | 1 | 10.110894 |
3 | 1 | 9.958819 |
4 | 1 | 10.287618 |
1.1: code with loop#
# %%time
# TODO uncomment the above line once you have the code working to see how long it takes to run.
# TODO implement the code using the loop and .loc indexing.
T1_sum = 0
T0_sum = 0
# sum up the values of X for each treatment group by iterating over the rows of the DataFrame
# Python note: "_" is a convention for a variable that is not used in the loop
for _, row in df.iterrows():
# TODO your code here
pass
# calculate the mean of X for each treatment group
T1_mean = T1_sum / df[df["T"] == 1].shape[0]
T0_mean = T0_sum / df[df["T"] == 0].shape[0]
# the difference should be close to 10 for the simulated data
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[4], line 19
16 T0_mean = T0_sum / df[df["T"] == 0].shape[0]
18 # the difference should be close to 10 for the simulated data
---> 19 assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)
AssertionError:
What is the wall clock time it takes for your code to run?
Your response: pollev.com/tliu
1.2 vectorized code#
# %%time
# TODO uncomment the above line once you have the code working to see how long it takes to run.
# TODO implement the code with no loops, using logical indexing and the mean() function. It should be much shorter than the code with the loop.
T1_mean = 0
T0_mean = 0
# the difference should be close to 10
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)
What is the wall clock time it takes for your code to run?
Your response: pollev.com/tliu
Takeaway
If you’re writing code processing NumPy arrays or pandas DataFrames and you find yourself writing a loop over the length of an array or DataFrame, try to think about whether you can vectorize the operation. Unless the operation explicitly calls for iterating over the rows in a DataFrame (like we do in implementing greedy pair matching for Project 2), it is likely that you can achieve the same result without the loop, resulting in both faster and more concise code.