Activity 20: Data wrangling with ecological data#

2025-11-24


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Part 1: Understanding the raw data#

We have messy table, landplot_raw, which corresponds to land-plot-level data:

  • plot_id: unique identifier for each plot

  • ecosystem_type: type of ecosystem (e.g., grassland, shrubland, forest)

  • trt_fertilizer_group: treatment group

  • biomass_2023: above-ground biomass in 2023 (g/m^2)

  • biomass_2024: above-ground biomass in 2024 (g/m^2)

  • soil_nitrogen_pct: soil nitrogen content prior to fertilization

Lets first inspect the structure of the dataframe:

landplots_raw = pd.read_csv("~/COMSC-341CD/data/plots_raw.csv")

# TODO inspect the rows, columns, and data types
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 landplots_raw = pd.read_csv("~/COMSC-341CD/data/plots_raw.csv")
      3 # TODO inspect the rows, columns, and data types

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/tliu/COMSC-341CD/data/plots_raw.csv'

Units and IDs#

For causal analysis, we need to be clear about what counts as a unit and how it is identified. We’ll use plot_id to check for uniqueness and duplicates. Let’s answer the following questions:

  1. How many duplicated rows are there in plots_raw?

  2. Which plot_ids, if any, are duplicated?

# TODO print the number of duplicated rows

# TODO print the duplicated plot_ids

Then, we can drop the duplicates. Instead of overwriting landplots_raw, we’ll create a new variable landplots_clean to preserve the original data.

landplots_clean = landplots_raw.copy()

# TODO drop duplicates from landplots_clean

Part 2: Cleaning variables and columns#

Next, we will clean key variables in plots_raw:

  1. standardize ecosystem_type labels

  2. rename trt_fertilizer_group to treatment_group

  3. standardize treatment_group labels in a new column treatment_group_clean

  4. create a binary indicator in a new column treated

# TODO standardize ecosystem_type labels

# TODO rename trt_fertilizer_group to treatment_group

# TODO standardize treatment_group labels in a new column treatment_group_clean

# TODO treated binary treatment indicator: 1 if nitrogen fertilized, 0 if control

Part 3: Missingness and bad values#

The biomass variables contain special codes and implausible values.

We’ll first recode -99 as NaN in biomass_2023 and biomass_2024:

# TODO replace -99 with NaN in `biomass_2023` and `biomass_2024`

Additionally, there are some implausible (negative) values in biomass_2023. We’ll flag these with an indicator biomass_2023_implausible, and also set those values in biomass_2023 to NaN:

# TODO flag implausible biomass values with an indicator

# TODO set implausible biomass values in `biomass_2023` column to NaN

Next, we’ll check the missingness rates of the variables. What percentage of values are missing in biomass_2023 and biomass_2024?

# TODO check the missingness rates of the variables

Part 4: Wide vs long data formats#

Currently, our landplots_clean dataframe is violating the tidy data principle of “one row = one observation”, due to the two biomass columns.

We can use pd.melt() to convert the dataframe from wide to long format. This pandas function “unpivots” (or “melts”) the dataframe, turning multiple columns into a single column of values, and the column names into a new column with our choice of name. Specifically, it takes the following arguments:

  • id_vars: columns to keep as identifier variables

  • value_vars: columns to unpivot

  • var_name: name of the new variable column

  • value_name: name of the new value column

# TODO melt the dataframe from wide to long format
landplots_long = None

# TODO clean the year column to be numeric (e.g., 2023, 2024)
#landplots_long["year"] = landplots_long["year"].apply(lambda x: x.split("_")[1]).astype(int)

Now that our data is in long format, we can visualize the average biomass over the two timepoints by treatment group using a sns.lineplot() with:

  • x = year

  • y = biomass

  • hue = treated

# TODO plot a line plot of biomass over time by treatment group

Finally, if we’d like to return to the wide format, we can use pd.pivot() to pivot the dataframe from long to wide format. This function takes the following arguments:

  • index: column(s) to use as the index (rows)

  • columns: variable to use as the columns: one new column for each unique value

  • values: column to use as the values to fill the new dataframe cells

# TODO pivot the dataframe from long to wide format
# index = plot_id
# columns = year
# values = biomass

landplots_wide = None