Activity 12: Vectorization#
2025-10-22
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
Logical indexing and vectorization#
Let’s briefly explore the benefits of “vectorization” in NumPy and Pandas, which is the practice of performing operations on entire arrays at once, rather than element-by-element via a loop.
We’ll compute the numerator of the standardized difference calculation \(d_X\) for Project 2, which is essentially the difference-in-means estimator for covariate \(X\):
Recall that \(\hat{E}[X \mid T=t]\) is the estimated mean of \(X\) given that \(T=t\).
Each code cell below has a %%time magic command that times the execution of the cell. You can use this to compare the performance of vectorized and looped code.
%%time
# short demo of the time magic command
for i in range(1000):
pass
CPU times: user 17 μs, sys: 0 ns, total: 17 μs
Wall time: 19.1 μs
# load in a simulated dataset with 50000 rows and T, X columns
df = pd.read_csv("~/COMSC-341CD/data/activity12_sim_data.csv")
df.head()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[3], line 2
1 # load in a simulated dataset with 50000 rows and T, X columns
----> 2 df = pd.read_csv("~/COMSC-341CD/data/activity12_sim_data.csv")
4 df.head()
File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/tliu/COMSC-341CD/data/activity12_sim_data.csv'
1.1: code with loop#
# %%time
# TODO uncomment the above line once you have the code working to see how long it takes to run.
# TODO implement the code using the loop and pandas logical indexing.
T1_sum = 0
T0_sum = 0
# sum up the values of X for each treatment group by iterating over the rows of the DataFrame
# Python note: "_" is a convention for a variable that is not used in the loop
for _, row in df.iterrows():
# TODO your code here
pass
# calculate the mean of X for each treatment group
T1_mean = T1_sum / df[df["T"] == 1].shape[0]
T0_mean = T0_sum / df[df["T"] == 0].shape[0]
# the difference should be close to 10 for the simulated data
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)
CPU times: user 484 ms, sys: 3.44 ms, total: 487 ms
Wall time: 486 ms
What is the wall clock time (in milliseconds) it takes for your code to run?
Your response: pollev.com/tliu
1.2 vectorized code#
# %%time
# TODO uncomment the above line once you have the code working to see how long it takes to run.
# TODO implement the code with no loops, using logical indexing and the mean() function. It should be much shorter than the code with the loop.
T1_mean = 0
T0_mean = 0
# the difference should be close to 10
assert np.isclose(T1_mean - T0_mean, 10, atol=1e-2)
CPU times: user 1.42 ms, sys: 705 μs, total: 2.12 ms
Wall time: 1.56 ms
What is the wall clock time (in milliseconds) it takes for your code to run?
Your response: pollev.com/tliu
Takeaway
If you’re writing code processing NumPy arrays or pandas DataFrames and you find yourself writing a loop over the length of an array or DataFrame, try to think about whether you can vectorize the operation. Unless the operation explicitly calls for iterating over the rows in a DataFrame (like we do in implementing greedy pair matching for Project 2), it is likely that you can achieve the same result without the loop, resulting in both faster and more concise code.