(activity12_solution)=
# Activity 12: Vectorization Solutions

**2025-10-22**

---

In [1]:
import numpy as np
import pandas as pd

rng = np.random.default_rng(42)

# Logical indexing and vectorization

Let's briefly explore the benefits of "vectorization" in NumPy and Pandas, which is the practice of performing operations on entire arrays at once, rather than element-by-element via a loop.

We'll compute the **numerator** of the [standardized difference calculation](https://comsc341cd.github.io/projects/proj2_functions.html#love-plots-for-visualizing-covariate-balance) $d_X$ for Project 2, which is essentially the difference-in-means estimator for covariate $X$:

$$
d_X = \frac{\hat{E}[X \mid T=1] - \hat{E}[X \mid T=0]} { \large \sqrt{\frac{\hat{V}[X \mid T=1] + \hat{V}[X \mid T=0]} {2}}}
$$

Recall that $\hat{E}[X \mid T=t]$ is the estimated mean of $X$ given that $T=t$.

Each code cell below has a `%%time` [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that times the execution of the cell. You can use this to compare the performance of vectorized and looped code.

In [3]:
%%time
# short demo of the time magic command

for i in range(1000):
    pass

CPU times: user 45 μs, sys: 8 μs, total: 53 μs
Wall time: 57.9 μs


In [4]:
# load in a simulated dataset with 50000 rows and T, X columns
df = pd.read_csv("~/COMSC-341CD/data/activity12_sim_data.csv")

df.head()

Unnamed: 0,T,X
0,1,10.46891
1,0,0.250017
2,1,10.110894
3,1,9.958819
4,1,10.287618


## 1.1: code with loop

In [8]:
%%time 
# TODO uncomment the above line once you have the code working to see how long it takes to run.

# TODO implement the code using the loop and pandas logical indexing.
T1_sum = 0
T0_sum = 0

# sum up the values of X for each treatment group by iterating over the rows of the DataFrame
# Python note: "_" is a convention for a variable that is not used in the loop
for _, row in df.iterrows():
     if row['T'] == 1:
        T1_sum += row['X']
     else:
        T0_sum += row['X']

# calculate the mean of X for each treatment group
T1_mean = T1_sum / df[df["T"] == 1].shape[0]
T0_mean = T0_sum / df[df["T"] == 0].shape[0]

# the difference should be close to 10 for the simulated data
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)

CPU times: user 1.42 s, sys: 980 μs, total: 1.42 s
Wall time: 1.42 s


What is the wall clock time (in milliseconds) it takes for your code to run?

**Your response**: [pollev.com/tliu](https://pollev.com/tliu)

## 1.2 vectorized code

In [10]:
%%time 
# TODO uncomment the above line once you have the code working to see how long it takes to run.

# TODO implement the code with no loops, using logical indexing and the mean() function. It should be much shorter than the code with the loop.
T1_mean = df[df['T'] == 1]['X'].mean()
T0_mean = df[df['T'] == 0]['X'].mean()

# the difference should be close to 10
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)

CPU times: user 4.19 ms, sys: 0 ns, total: 4.19 ms
Wall time: 3.2 ms


What is the wall clock time (in milliseconds) it takes for your code to run?

**Your response**: [pollev.com/tliu](https://pollev.com/tliu)

:::{admonition} Takeaway

If you're writing code processing NumPy arrays or pandas DataFrames and you find yourself writing a loop over the length of an array or DataFrame, try to think about whether you can vectorize the operation. Unless the operation explicitly calls for iterating over the rows in a DataFrame (like we do in implementing greedy pair matching for Project 2), it is likely that you can achieve the same result without the loop, resulting in both faster and more concise code.
:::