Water Supply Forecast Rodeo

Nima Sarang

1 The Competition

I participated in the Water Supply Forecast Rodeo, an ML competition with the objective of forecasting water supply in the western United States. I was able to make it to the third (final) round, where our models were evaluated in real-time on unseen data from Jan to Jun 2024. I was able to achieve the 5th place and overall it was a great experience. There were other aspects to the competition such as writing and presenting the implementation details, but my only focus was on the forecasting aspect.

One of the biggest challenges for me was the variety of datasets and their different formats, which was overwhelming. On one hand, it was valuable to learn about different climate and hydrology datasets and spend time on pre-processing, feature engineering, and feature selection. On the other hand, I would’ve preferred to spend more time on the actual modeling. Interestingly enough, the model I ended up submitting was something I had created as a baseline to compare against the actual models I was planning to build. My final model, which was a neural network, wasn’t ready before the deadline. I was surprised that the baseline model was able to achieve such a good score.

I put all of my development work on Github here. The latest commit has also my incomplete work which didn’t make it to the final submission. If you want to see the codebase in the state when I created the baseline model, you can checkout to git checkout v1.0.0.

If you’ve read so far, I wanted to introduce the problem and the datasets that I used.

2 Problem Statement

The challenge was pretty straightforward on paper: predict seasonal water supply at 26 sites across the western United States. Specifically, we needed to forecast the cumulative streamflow volume (measured in thousand acre-feet) for the typical forecast season of April through July. The tricky part was that we had to issue these forecasts multiple times throughout the year - on the 1st, 8th, 15th, and 22nd of each month from January all the way through July.

What made this interesting was that we weren’t just predicting a single number. Instead, we had to provide quantile forecasts - the 0.10, 0.50 (median), and 0.90 quantiles, basically giving a range of possible outcomes. The streamflow data we were predicting represented “natural flow” - essentially what the water flow would’ve been without any human interference like dams or diversions.

Figure 1: Streamflow volume over time at Colville R at Kettle Falls.

The following interactive map shows the locations and boundaries of the river basins for all 26 sites. I’ve also included supplementary NRCS sites to provide a broader context of the monitoring network and its geographical distribution in the region.

Make this Notebook Trusted to load map: File -> Trust Notebook

Figure 2: Interactive map of the 26 competition basins (blue polygons) and supplementary NRCS monitoring sites (red circles). Use the layer control in the top right to switch between different basemaps.

3 Datasets

The competition provided access to a ridiculous amount of data sources. Here’s what was available:

Streamflow & Antecedent Data:

NRCS/RFCs monthly naturalized flow - historical monthly naturalized flow at forecast sites
USGS streamflow - daily observed streamflow measurements from USGS streamgages
USBR reservoir inflow - inflow data into Bureau of Reclamation reservoirs

Snowpack Data:

NRCS SNOTEL - automated snow telemetry from 900+ high-elevation sites
CDEC Snow Sensor Network - California’s snowpack monitoring stations
SNODAS - daily 1km snow data assimilation system covering continental US
UA/SWANN - University of Arizona’s neural network-based snow water equivalent data
MODIS Snow Cover - satellite-based snow cover data at 500m resolution

Weather & Climate:

RCC-ACIS - historical and near real-time climate observations
CPC Seasonal Outlooks - seasonal temperature and precipitation forecasts
Copernicus seasonal forecasts - multi-system seasonal meteorological forecasts
ERA5-Land reanalysis - global reanalysis of land variables
NLDAS-2 forcing data - meteorological variables from North American Land Data Assimilation System
NCEP/NCAR Reanalysis - gridded atmospheric and land variables
USGS SSEBop Evapotranspiration - satellite-based evapotranspiration estimates

Drought & Moisture:

Palmer Drought Severity Index - gridded drought index from gridMET data
GRACE soil moisture - satellite-based soil moisture and groundwater indicators

Climate Indices:

Oceanic Niño Index - El Niño/La Niña indicator from sea surface temperatures
Southern Oscillation Index - pressure-based ENSO indicator
Pacific Decadal Oscillation - longer-term Pacific climate pattern
Madden-Julian Oscillation - tropical weather pattern indices
Pacific North American Index - large-scale atmospheric circulation pattern

Vegetation & Land:

MODIS Vegetation Indices - satellite-based vegetation measurements
Copernicus DEM - 90m resolution digital elevation model
NLCD Urban Imperviousness - water-resistant surface measurements
BasinATLAS - comprehensive basin attributes database

3.1 Examples

Let’s look into some of the datasets. It took me a while to create the visualizations when I was working on the competition, so I thought it would be nice to share them here.

Figure 3: Animation of the Daily Mean Palmer Drought Severity Index (PDSI) for the 2022 water year (October 2021 - July 2022). Red areas indicate drier conditions, while blue areas represent wetter conditions.

Figure 4: Snapshot of groundwater storage across the United States on March 24, 2014, from GRACE satellite data. Red areas indicate higher-than-average groundwater levels, while blue areas show lower levels.

Next is hydrological analysis of a sample basin using the Copernicus Digital Elevation Model (DEM). These visualizations demonstrate how raw elevation data is processed to model water flow paths, which is a key step in understanding watershed behavior for streamflow forecasting.

Figure 5: Original digital elevation model with hillshade overlay

Figure 6: Depression-filled DEM using RichDEM - Elevation data with small pits and sinks removed

Figure 7: D8 flow accumulation on depression-filled DEM - Predicted water flow patterns where brighter areas indicate higher flow accumulation.

Citation

BibTeX citation:

@online{sarang2024,
  author = {Sarang, Nima},
  title = {Water {Supply} {Forecast} {Rodeo}},
  date = {2024-10-15},
  url = {https://www.nimasarang.com/project/2024-10-15-water-supply-forecast-rodeo/},
  langid = {en}
}

For attribution, please cite this work as:

Sarang, Nima. 2024. “Water Supply Forecast Rodeo.” October 15, 2024. https://www.nimasarang.com/project/2024-10-15-water-supply-forecast-rodeo/.