Activity 20 Solution: Data wrangling with ecological data

Activity 20 Solution: Data wrangling with ecological data#

2025-11-24

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Part 1: Understanding the raw data#

We have messy table, landplot_raw, which corresponds to land-plot-level data:

plot_id: unique identifier for each plot
ecosystem_type: type of ecosystem (e.g., grassland, shrubland, forest)
trt_fertilizer_group: treatment group
biomass_2023: above-ground biomass in 2023 (g/m^2)
biomass_2024: above-ground biomass in 2024 (g/m^2)
soil_nitrogen_pct: soil nitrogen content prior to fertilization

Lets first inspect the structure of the dataframe:

landplots_raw = pd.read_csv("~/COMSC-341CD/data/plots_raw.csv")

landplots_raw.dtypes

# TODO inspect the rows, columns, and data types

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 landplots_raw = pd.read_csv("~/COMSC-341CD/data/plots_raw.csv")
landplots_raw.dtypes
# TODO inspect the rows, columns, and data types

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)
   dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File ~/Github/comsc341-cd/venv/lib/python3.11/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/tliu/COMSC-341CD/data/plots_raw.csv'

Units and IDs#

For causal analysis, we need to be clear about what counts as a unit and how it is identified. We’ll use plot_id to check for uniqueness and duplicates. Let’s answer the following questions:

How many duplicated rows are there in plots_raw?
Which plot_ids, if any, are duplicated?

# TODO print the number of duplicated rows
print(landplots_raw.shape[0] - landplots_raw['plot_id'].nunique())

# TODO print the duplicated plot_ids
print(landplots_raw[landplots_raw.duplicated(subset=['plot_id'])]['plot_id'])

landplots_raw[landplots_raw['plot_id'] == 501]

5
  501
  532
  619
  604
  688
Name: plot_id, dtype: int64

	plot_id	site_id	ecosystem_type	biomass_2023	biomass_2024	trt_fertilizer_group	soil_nitrogen_pct
1	501	S1	forest	346.1	349.1	fert	0.452
250	501	S1	forest	346.1	349.1	fert	0.452

Then, we can drop the duplicates. Instead of overwriting landplots_raw, we’ll create a new variable landplots_clean to preserve the original data.

landplots_clean = landplots_raw.copy()

# TODO drop duplicates from landplots_clean
landplots_clean = landplots_clean.drop_duplicates(subset=['plot_id'])

landplots_clean.shape[0] - landplots_clean['plot_id'].nunique()

#landplots_clean[landplots_clean['plot_id'] == 501]

Part 2: Cleaning variables and columns#

Next, we will clean key variables in plots_raw:

standardize ecosystem_type labels
rename trt_fertilizer_group to treatment_group
standardize treatment_group labels in a new column treatment_group_clean
create a binary indicator in a new column treated

# TODO standardize ecosystem_type labels
landplots_clean['ecosystem_type'] = landplots_clean['ecosystem_type'].str.lower().str.strip()
print(landplots_clean['ecosystem_type'].unique())

# TODO rename trt_fertilizer_group to treatment_group
landplots_clean = landplots_clean.rename(
    columns={'trt_fertilizer_group': 'treat_group'}
)

['forest' 'grassland']

landplots_clean.columns

Index(['plot_id', 'site_id', 'ecosystem_type', 'biomass_2023', 'biomass_2024',
       'treat_group', 'soil_nitrogen_pct'],
      dtype='object')

# TODO standardize treatment_group labels in a new column treatment_group_clean
print(landplots_clean['treat_group'].unique())

landplots_clean['treat_group_clean'] = landplots_clean['treat_group'].replace(
    {'T': 'fert',
     'C': 'control'}
)

print(landplots_clean['treat_group_clean'].unique())

# TODO treated binary treatment indicator: 1 if nitrogen fertilized, 0 if control
landplots_clean['treated'] = np.where(landplots_clean['treat_group_clean'] == 'fert', 1, 0)

['fert' 'C' 'T' 'control']
['fert' 'control']

landplots_clean.head()

	plot_id	site_id	ecosystem_type	biomass_2023	biomass_2024	treat_group	soil_nitrogen_pct	treat_group_clean	treated
0	500	S4	forest	245.5	185.5	fert	0.300	fert	1
1	501	S1	forest	346.1	349.1	fert	0.452	fert	1
2	502	S1	grassland	604.1	659.3	C	0.486	control	0
3	503	S1	forest	452.0	431.5	T	0.417	fert	1
4	504	S2	forest	329.5	402.2	fert	0.515	fert	1

print(np.where(landplots_clean['treat_group_clean'] == 'fert', 1, 0)[:5])
landplots_clean.head()
#help(np.where)

[1 1 0 1 1]

	plot_id	site_id	ecosystem_type	biomass_2023	biomass_2024	treat_group	soil_nitrogen_pct	treat_group_clean
0	500	S4	forest	245.5	185.5	fert	0.300	fert
1	501	S1	forest	346.1	349.1	fert	0.452	fert
2	502	S1	grassland	604.1	659.3	C	0.486	control
3	503	S1	forest	452.0	431.5	T	0.417	fert
4	504	S2	forest	329.5	402.2	fert	0.515	fert

Part 3: Missingness and bad values#

The biomass variables contain special codes and implausible values.

We’ll first recode -99 as NaN in biomass_2023 and biomass_2024:

# TODO replace -99 with NaN in `biomass_2023` and `biomass_2024`
print(landplots_clean['biomass_2023'].value_counts())

landplots_clean['biomass_2023'] = landplots_clean['biomass_2023'].replace({
    -99: np.nan,
    -50: np.nan
})

print(landplots_clean['biomass_2023'].value_counts())

biomass_2023
-99.0     5
-50.0     5
 449.6    2
 349.7    2
 480.0    2
         ..
 356.2    1
 346.0    1
 277.8    1
 359.7    1
 367.4    1
Name: count, Length: 232, dtype: int64
biomass_2023
498.0    2
428.9    2
349.7    2
451.4    2
367.9    2
        ..
356.2    1
346.0    1
277.8    1
359.7    1
367.4    1
Name: count, Length: 230, dtype: int64

Additionally, there are some implausible (negative) values in biomass_2023. We’ll flag these with an indicator biomass_2023_implausible, and also set those values in biomass_2023 to NaN:

# TODO flag implausible biomass values with an indicator

# TODO set implausible biomass values in `biomass_2023` column to NaN

Next, we’ll check the missingness rates of the variables. What percentage of values are missing in biomass_2023 and biomass_2024?

# TODO check the missingness rates of the variables
landplots_clean['biomass_2023'].isna().mean()

np.float64(0.04)

Part 4: Wide vs long data formats#

Currently, our landplots_clean dataframe is violating the tidy data principle of “one row = one observation”, due to the two biomass columns.

We can use pd.melt() to convert the dataframe from wide to long format. This pandas function “unpivots” (or “melts”) the dataframe, turning multiple columns into a single column of values, and the column names into a new column with our choice of name. Specifically, it takes the following arguments:

id_vars: columns to keep as identifier variables
value_vars: columns to unpivot
var_name: name of the new variable column
value_name: name of the new value column

display(landplots_clean.head())

# TODO melt the dataframe from wide to long format
landplots_long = pd.melt(landplots_clean,
                         id_vars=["plot_id", "treated","ecosystem_type"],
                         value_vars=["biomass_2023", "biomass_2024"],
                         var_name="year",
                         value_name="biomass")


# TODO clean the year column to be numeric (e.g., 2023, 2024)
landplots_long["year"] = landplots_long["year"].apply(lambda x: x.split("_")[1]).astype(int)

display(landplots_long)
print(landplots_long.shape)

	plot_id	site_id	ecosystem_type	biomass_2023	biomass_2024	treat_group	soil_nitrogen_pct	treat_group_clean	treated
0	500	S4	forest	245.5	185.5	fert	0.300	fert	1
1	501	S1	forest	346.1	349.1	fert	0.452	fert	1
2	502	S1	grassland	604.1	659.3	C	0.486	control	0
3	503	S1	forest	452.0	431.5	T	0.417	fert	1
4	504	S2	forest	329.5	402.2	fert	0.515	fert	1

	plot_id	treated	ecosystem_type	year	biomass
0	500	1	forest	2023	245.5
1	501	1	forest	2023	346.1
2	502	0	grassland	2023	604.1
3	503	1	forest	2023	452.0
4	504	1	forest	2023	329.5
...	...	...	...	...	...
495	745	1	grassland	2024	467.9
496	746	0	grassland	2024	519.7
497	747	1	grassland	2024	335.3
498	748	0	grassland	2024	449.1
499	749	1	grassland	2024	430.9

500 rows × 5 columns

(500, 5)

Now that our data is in long format, we can visualize the average biomass over the two timepoints by treatment group using a sns.lineplot() with:

x = year
y = biomass
hue = treated

# TODO plot a line plot of biomass over time by treatment group
sns.lineplot(x='year', y='biomass', hue='treated', data=landplots_long)

<Axes: xlabel='year', ylabel='biomass'>

../_images/d2e23593c3dc101ada0aaa6bfb3b489fad6033969b79ceb65cfba41f1117024a.png

Finally, if we’d like to return to the wide format, we can use pd.pivot() to pivot the dataframe from long to wide format. This function takes the following arguments:

index: column(s) to use as the index (rows)
columns: variable to use as the columns: one new column for each unique value
values: column to use as the values to fill the new dataframe cells

# TODO pivot the dataframe from long to wide format
# index = plot_id
# columns = year
# values = biomass

landplots_wide = landplots_long.pivot(
    index="plot_id",
    columns="year",
    values="biomass"
)

landplots_wide

year	2023	2024
plot_id
500	245.5	185.5
501	346.1	349.1
502	604.1	659.3
503	452.0	431.5
504	329.5	402.2
...	...	...
745	447.4	467.9
746	416.6	519.7
747	342.1	335.3
748	472.7	449.1
749	367.4	430.9

250 rows × 2 columns