Activity 20 Solution: Data wrangling with ecological data#

2025-11-24


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Part 1: Understanding the raw data#

We have messy table, landplot_raw, which corresponds to land-plot-level data:

  • plot_id: unique identifier for each plot

  • ecosystem_type: type of ecosystem (e.g., grassland, shrubland, forest)

  • trt_fertilizer_group: treatment group

  • biomass_2023: above-ground biomass in 2023 (g/m^2)

  • biomass_2024: above-ground biomass in 2024 (g/m^2)

  • soil_nitrogen_pct: soil nitrogen content prior to fertilization

Lets first inspect the structure of the dataframe:

landplots_raw = pd.read_csv("~/COMSC-341CD/data/plots_raw.csv")

landplots_raw.dtypes

# TODO inspect the rows, columns, and data types
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 landplots_raw = pd.read_csv("~/COMSC-341CD/data/plots_raw.csv")
      3 landplots_raw.dtypes
      5 # TODO inspect the rows, columns, and data types

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/tliu/COMSC-341CD/data/plots_raw.csv'

Units and IDs#

For causal analysis, we need to be clear about what counts as a unit and how it is identified. We’ll use plot_id to check for uniqueness and duplicates. Let’s answer the following questions:

  1. How many duplicated rows are there in plots_raw?

  2. Which plot_ids, if any, are duplicated?

# TODO print the number of duplicated rows
print(landplots_raw.shape[0] - landplots_raw['plot_id'].nunique())

# TODO print the duplicated plot_ids
print(landplots_raw[landplots_raw.duplicated(subset=['plot_id'])]['plot_id'])

landplots_raw[landplots_raw['plot_id'] == 501]
5
250    501
251    532
252    619
253    604
254    688
Name: plot_id, dtype: int64
plot_id site_id ecosystem_type biomass_2023 biomass_2024 trt_fertilizer_group soil_nitrogen_pct
1 501 S1 forest 346.1 349.1 fert 0.452
250 501 S1 forest 346.1 349.1 fert 0.452

Then, we can drop the duplicates. Instead of overwriting landplots_raw, we’ll create a new variable landplots_clean to preserve the original data.

landplots_clean = landplots_raw.copy()

# TODO drop duplicates from landplots_clean
landplots_clean = landplots_clean.drop_duplicates(subset=['plot_id'])

landplots_clean.shape[0] - landplots_clean['plot_id'].nunique()

#landplots_clean[landplots_clean['plot_id'] == 501]
0

Part 2: Cleaning variables and columns#

Next, we will clean key variables in plots_raw:

  1. standardize ecosystem_type labels

  2. rename trt_fertilizer_group to treatment_group

  3. standardize treatment_group labels in a new column treatment_group_clean

  4. create a binary indicator in a new column treated

# TODO standardize ecosystem_type labels
landplots_clean['ecosystem_type'] = landplots_clean['ecosystem_type'].str.lower().str.strip()
print(landplots_clean['ecosystem_type'].unique())

# TODO rename trt_fertilizer_group to treatment_group
landplots_clean = landplots_clean.rename(
    columns={'trt_fertilizer_group': 'treat_group'}
)
['forest' 'grassland']
landplots_clean.columns
Index(['plot_id', 'site_id', 'ecosystem_type', 'biomass_2023', 'biomass_2024',
       'treat_group', 'soil_nitrogen_pct'],
      dtype='object')
# TODO standardize treatment_group labels in a new column treatment_group_clean
print(landplots_clean['treat_group'].unique())

landplots_clean['treat_group_clean'] = landplots_clean['treat_group'].replace(
    {'T': 'fert',
     'C': 'control'}
)

print(landplots_clean['treat_group_clean'].unique())

# TODO treated binary treatment indicator: 1 if nitrogen fertilized, 0 if control
landplots_clean['treated'] = np.where(landplots_clean['treat_group_clean'] == 'fert', 1, 0)
['fert' 'C' 'T' 'control']
['fert' 'control']
landplots_clean.head()
plot_id site_id ecosystem_type biomass_2023 biomass_2024 treat_group soil_nitrogen_pct treat_group_clean treated
0 500 S4 forest 245.5 185.5 fert 0.300 fert 1
1 501 S1 forest 346.1 349.1 fert 0.452 fert 1
2 502 S1 grassland 604.1 659.3 C 0.486 control 0
3 503 S1 forest 452.0 431.5 T 0.417 fert 1
4 504 S2 forest 329.5 402.2 fert 0.515 fert 1
print(np.where(landplots_clean['treat_group_clean'] == 'fert', 1, 0)[:5])
landplots_clean.head()
#help(np.where)
[1 1 0 1 1]
plot_id site_id ecosystem_type biomass_2023 biomass_2024 treat_group soil_nitrogen_pct treat_group_clean
0 500 S4 forest 245.5 185.5 fert 0.300 fert
1 501 S1 forest 346.1 349.1 fert 0.452 fert
2 502 S1 grassland 604.1 659.3 C 0.486 control
3 503 S1 forest 452.0 431.5 T 0.417 fert
4 504 S2 forest 329.5 402.2 fert 0.515 fert

Part 3: Missingness and bad values#

The biomass variables contain special codes and implausible values.

We’ll first recode -99 as NaN in biomass_2023 and biomass_2024:

# TODO replace -99 with NaN in `biomass_2023` and `biomass_2024`
print(landplots_clean['biomass_2023'].value_counts())

landplots_clean['biomass_2023'] = landplots_clean['biomass_2023'].replace({
    -99: np.nan,
    -50: np.nan
})

print(landplots_clean['biomass_2023'].value_counts())
biomass_2023
-99.0     5
-50.0     5
 449.6    2
 349.7    2
 480.0    2
         ..
 356.2    1
 346.0    1
 277.8    1
 359.7    1
 367.4    1
Name: count, Length: 232, dtype: int64
biomass_2023
498.0    2
428.9    2
349.7    2
451.4    2
367.9    2
        ..
356.2    1
346.0    1
277.8    1
359.7    1
367.4    1
Name: count, Length: 230, dtype: int64

Additionally, there are some implausible (negative) values in biomass_2023. We’ll flag these with an indicator biomass_2023_implausible, and also set those values in biomass_2023 to NaN:

# TODO flag implausible biomass values with an indicator

# TODO set implausible biomass values in `biomass_2023` column to NaN

Next, we’ll check the missingness rates of the variables. What percentage of values are missing in biomass_2023 and biomass_2024?

# TODO check the missingness rates of the variables
landplots_clean['biomass_2023'].isna().mean()
np.float64(0.04)

Part 4: Wide vs long data formats#

Currently, our landplots_clean dataframe is violating the tidy data principle of “one row = one observation”, due to the two biomass columns.

We can use pd.melt() to convert the dataframe from wide to long format. This pandas function “unpivots” (or “melts”) the dataframe, turning multiple columns into a single column of values, and the column names into a new column with our choice of name. Specifically, it takes the following arguments:

  • id_vars: columns to keep as identifier variables

  • value_vars: columns to unpivot

  • var_name: name of the new variable column

  • value_name: name of the new value column

display(landplots_clean.head())

# TODO melt the dataframe from wide to long format
landplots_long = pd.melt(landplots_clean,
                         id_vars=["plot_id", "treated","ecosystem_type"],
                         value_vars=["biomass_2023", "biomass_2024"],
                         var_name="year",
                         value_name="biomass")


# TODO clean the year column to be numeric (e.g., 2023, 2024)
landplots_long["year"] = landplots_long["year"].apply(lambda x: x.split("_")[1]).astype(int)

display(landplots_long)
print(landplots_long.shape)
plot_id site_id ecosystem_type biomass_2023 biomass_2024 treat_group soil_nitrogen_pct treat_group_clean treated
0 500 S4 forest 245.5 185.5 fert 0.300 fert 1
1 501 S1 forest 346.1 349.1 fert 0.452 fert 1
2 502 S1 grassland 604.1 659.3 C 0.486 control 0
3 503 S1 forest 452.0 431.5 T 0.417 fert 1
4 504 S2 forest 329.5 402.2 fert 0.515 fert 1
plot_id treated ecosystem_type year biomass
0 500 1 forest 2023 245.5
1 501 1 forest 2023 346.1
2 502 0 grassland 2023 604.1
3 503 1 forest 2023 452.0
4 504 1 forest 2023 329.5
... ... ... ... ... ...
495 745 1 grassland 2024 467.9
496 746 0 grassland 2024 519.7
497 747 1 grassland 2024 335.3
498 748 0 grassland 2024 449.1
499 749 1 grassland 2024 430.9

500 rows × 5 columns

(500, 5)

Now that our data is in long format, we can visualize the average biomass over the two timepoints by treatment group using a sns.lineplot() with:

  • x = year

  • y = biomass

  • hue = treated

# TODO plot a line plot of biomass over time by treatment group
sns.lineplot(x='year', y='biomass', hue='treated', data=landplots_long)
<Axes: xlabel='year', ylabel='biomass'>
../_images/d2e23593c3dc101ada0aaa6bfb3b489fad6033969b79ceb65cfba41f1117024a.png

Finally, if we’d like to return to the wide format, we can use pd.pivot() to pivot the dataframe from long to wide format. This function takes the following arguments:

  • index: column(s) to use as the index (rows)

  • columns: variable to use as the columns: one new column for each unique value

  • values: column to use as the values to fill the new dataframe cells

# TODO pivot the dataframe from long to wide format
# index = plot_id
# columns = year
# values = biomass

landplots_wide = landplots_long.pivot(
    index="plot_id",
    columns="year",
    values="biomass"
)

landplots_wide
year 2023 2024
plot_id
500 245.5 185.5
501 346.1 349.1
502 604.1 659.3
503 452.0 431.5
504 329.5 402.2
... ... ...
745 447.4 467.9
746 416.6 519.7
747 342.1 335.3
748 472.7 449.1
749 367.4 430.9

250 rows × 2 columns