Activity 12: Vectorization#

2025-10-22


import numpy as np
import pandas as pd

rng = np.random.default_rng(42)

Logical indexing and vectorization#

Let’s briefly explore the benefits of “vectorization” in NumPy and Pandas, which is the practice of performing operations on entire arrays at once, rather than element-by-element via a loop.

We’ll compute the numerator of the standardized difference calculation \(d_X\) for Project 2, which is essentially the difference-in-means estimator for covariate \(X\):

\[ d_X = \frac{\hat{E}[X \mid T=1] - \hat{E}[X \mid T=0]} { \large \sqrt{\frac{\hat{V}[X \mid T=1] + \hat{V}[X \mid T=0]} {2}}} \]

Recall that \(\hat{E}[X \mid T=t]\) is the estimated mean of \(X\) given that \(T=t\).

Each code cell below has a %%time magic command that times the execution of the cell. You can use this to compare the performance of vectorized and looped code.

%%time
# short demo of the time magic command

for i in range(1000):
    pass
CPU times: user 17 μs, sys: 0 ns, total: 17 μs
Wall time: 19.1 μs
# load in a simulated dataset with 50000 rows and T, X columns
df = pd.read_csv("~/COMSC-341CD/data/activity12_sim_data.csv")

df.head()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[3], line 2
      1 # load in a simulated dataset with 50000 rows and T, X columns
----> 2 df = pd.read_csv("~/COMSC-341CD/data/activity12_sim_data.csv")
      4 df.head()

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/tliu/COMSC-341CD/data/activity12_sim_data.csv'

1.1: code with loop#

# %%time 
# TODO uncomment the above line once you have the code working to see how long it takes to run.

# TODO implement the code using the loop and pandas logical indexing.
T1_sum = 0
T0_sum = 0

# sum up the values of X for each treatment group by iterating over the rows of the DataFrame
# Python note: "_" is a convention for a variable that is not used in the loop
for _, row in df.iterrows():
    # TODO your code here
    pass

# calculate the mean of X for each treatment group
T1_mean = T1_sum / df[df["T"] == 1].shape[0]
T0_mean = T0_sum / df[df["T"] == 0].shape[0]

# the difference should be close to 10 for the simulated data
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)
CPU times: user 484 ms, sys: 3.44 ms, total: 487 ms
Wall time: 486 ms

What is the wall clock time (in milliseconds) it takes for your code to run?

Your response: pollev.com/tliu

1.2 vectorized code#

# %%time 
# TODO uncomment the above line once you have the code working to see how long it takes to run.

# TODO implement the code with no loops, using logical indexing and the mean() function. It should be much shorter than the code with the loop.
T1_mean = 0
T0_mean = 0


# the difference should be close to 10
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)
CPU times: user 1.42 ms, sys: 705 μs, total: 2.12 ms
Wall time: 1.56 ms

What is the wall clock time (in milliseconds) it takes for your code to run?

Your response: pollev.com/tliu

Takeaway

If you’re writing code processing NumPy arrays or pandas DataFrames and you find yourself writing a loop over the length of an array or DataFrame, try to think about whether you can vectorize the operation. Unless the operation explicitly calls for iterating over the rows in a DataFrame (like we do in implementing greedy pair matching for Project 2), it is likely that you can achieve the same result without the loop, resulting in both faster and more concise code.