Activity 20: Data wrangling with ecological data

Activity 20: Data wrangling with ecological data#

2025-11-24

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Part 1: Understanding the raw data#

We have messy table, landplot_raw, which corresponds to land-plot-level data:

plot_id: unique identifier for each plot
ecosystem_type: type of ecosystem (e.g., grassland, shrubland, forest)
trt_fertilizer_group: treatment group
biomass_2023: above-ground biomass in 2023 (g/m^2)
biomass_2024: above-ground biomass in 2024 (g/m^2)
soil_nitrogen_pct: soil nitrogen content prior to fertilization

Lets first inspect the structure of the dataframe:

landplots_raw = pd.read_csv("~/COMSC-341CD/data/plots_raw.csv")

# TODO inspect the rows, columns, and data types

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 landplots_raw = pd.read_csv("~/COMSC-341CD/data/plots_raw.csv")
# TODO inspect the rows, columns, and data types

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)
   dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/tliu/COMSC-341CD/data/plots_raw.csv'

Units and IDs#

For causal analysis, we need to be clear about what counts as a unit and how it is identified. We’ll use plot_id to check for uniqueness and duplicates. Let’s answer the following questions:

How many duplicated rows are there in plots_raw?
Which plot_ids, if any, are duplicated?

# TODO print the number of duplicated rows

# TODO print the duplicated plot_ids

Then, we can drop the duplicates. Instead of overwriting landplots_raw, we’ll create a new variable landplots_clean to preserve the original data.

landplots_clean = landplots_raw.copy()

# TODO drop duplicates from landplots_clean

Part 2: Cleaning variables and columns#

Next, we will clean key variables in plots_raw:

standardize ecosystem_type labels
rename trt_fertilizer_group to treatment_group
standardize treatment_group labels in a new column treatment_group_clean
create a binary indicator in a new column treated

# TODO standardize ecosystem_type labels

# TODO rename trt_fertilizer_group to treatment_group

# TODO standardize treatment_group labels in a new column treatment_group_clean

# TODO treated binary treatment indicator: 1 if nitrogen fertilized, 0 if control

Part 3: Missingness and bad values#

The biomass variables contain special codes and implausible values.

We’ll first recode -99 as NaN in biomass_2023 and biomass_2024:

# TODO replace -99 with NaN in `biomass_2023` and `biomass_2024`

Additionally, there are some implausible (negative) values in biomass_2023. We’ll flag these with an indicator biomass_2023_implausible, and also set those values in biomass_2023 to NaN:

# TODO flag implausible biomass values with an indicator

# TODO set implausible biomass values in `biomass_2023` column to NaN

Next, we’ll check the missingness rates of the variables. What percentage of values are missing in biomass_2023 and biomass_2024?

# TODO check the missingness rates of the variables

Part 4: Wide vs long data formats#

Currently, our landplots_clean dataframe is violating the tidy data principle of “one row = one observation”, due to the two biomass columns.

We can use pd.melt() to convert the dataframe from wide to long format. This pandas function “unpivots” (or “melts”) the dataframe, turning multiple columns into a single column of values, and the column names into a new column with our choice of name. Specifically, it takes the following arguments:

id_vars: columns to keep as identifier variables
value_vars: columns to unpivot
var_name: name of the new variable column
value_name: name of the new value column

# TODO melt the dataframe from wide to long format
landplots_long = None

# TODO clean the year column to be numeric (e.g., 2023, 2024)
#landplots_long["year"] = landplots_long["year"].apply(lambda x: x.split("_")[1]).astype(int)

Now that our data is in long format, we can visualize the average biomass over the two timepoints by treatment group using a sns.lineplot() with:

x = year
y = biomass
hue = treated

# TODO plot a line plot of biomass over time by treatment group

Finally, if we’d like to return to the wide format, we can use pd.pivot() to pivot the dataframe from long to wide format. This function takes the following arguments:

index: column(s) to use as the index (rows)
columns: variable to use as the columns: one new column for each unique value
values: column to use as the values to fill the new dataframe cells

# TODO pivot the dataframe from long to wide format
# index = plot_id
# columns = year
# values = biomass

landplots_wide = None