A report on my 5th place finish in the Water Supply Forecast Rodeo, an ML competition on predicting seasonal water supply in the western US.
Author

Nima Sarang

Published

October 15, 2024

1 The Competition

I participated in the Water Supply Forecast Rodeo, an ML competition with the objective of forecasting water supply in the western United States. I was able to make it to the third (final) round, where our models were evaluated in real-time on unseen data from Jan to Jun 2024. I was able to achieve the 5th place and overall it was a great experience. There were other aspects to the competition such as writing and presenting the implementation details, but my only focus was on the forecasting aspect.

One of the biggest challenges for me was the variety of datasets and their different formats, which was overwhelming. On one hand, it was valuable to learn about different climate and hydrology datasets and spend time on pre-processing, feature engineering, and feature selection. On the other hand, I would’ve preferred to spend more time on the actual modeling. Interestingly enough, the model I ended up submitting was something I had created as a baseline to compare against the actual models I was planning to build. My final model, which was a neural network, wasn’t ready before the deadline. I was surprised that the baseline model was able to achieve such a good score.

I put all of my development work on Github here. The latest commit has also my incomplete work which didn’t make it to the final submission. If you want to see the codebase in the state when I created the baseline model, you can checkout to git checkout v1.0.0.

If you’ve read so far, I wanted to introduce the problem and the datasets that I used.

2 Problem Statement

The challenge was pretty straightforward on paper: predict seasonal water supply at 26 sites across the western United States. Specifically, we needed to forecast the cumulative streamflow volume (measured in thousand acre-feet) for the typical forecast season of April through July. The tricky part was that we had to issue these forecasts multiple times throughout the year - on the 1st, 8th, 15th, and 22nd of each month from January all the way through July.

What made this interesting was that we weren’t just predicting a single number. Instead, we had to provide quantile forecasts - the 0.10, 0.50 (median), and 0.90 quantiles, basically giving a range of possible outcomes. The streamflow data we were predicting represented “natural flow” - essentially what the water flow would’ve been without any human interference like dams or diversions.

Figure 1: Streamflow volume over time at Colville R at Kettle Falls.

The following interactive map shows the locations and boundaries of the river basins for all 26 sites. I’ve also included supplementary NRCS sites to provide a broader context of the monitoring network and its geographical distribution in the region.

Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 2: Interactive map of the 26 competition basins (blue polygons) and supplementary NRCS monitoring sites (red circles). Use the layer control in the top right to switch between different basemaps.

3 Datasets

The competition provided access to a ridiculous amount of data sources. Here’s what was available:

Streamflow & Antecedent Data:

Snowpack Data:

  • NRCS SNOTEL - automated snow telemetry from 900+ high-elevation sites
  • CDEC Snow Sensor Network - California’s snowpack monitoring stations
  • SNODAS - daily 1km snow data assimilation system covering continental US
  • UA/SWANN - University of Arizona’s neural network-based snow water equivalent data
  • MODIS Snow Cover - satellite-based snow cover data at 500m resolution

Weather & Climate:

Drought & Moisture:

Climate Indices:

Vegetation & Land:

3.1 Examples

Let’s look into some of the datasets. It took me a while to create the visualizations when I was working on the competition, so I thought it would be nice to share them here.

Figure 3: Animation of the Daily Mean Palmer Drought Severity Index (PDSI) for the 2022 water year (October 2021 - July 2022). Red areas indicate drier conditions, while blue areas represent wetter conditions.
Figure 4: Snapshot of groundwater storage across the United States on March 24, 2014, from GRACE satellite data. Red areas indicate higher-than-average groundwater levels, while blue areas show lower levels.

Next is hydrological analysis of a sample basin using the Copernicus Digital Elevation Model (DEM). These visualizations demonstrate how raw elevation data is processed to model water flow paths, which is a key step in understanding watershed behavior for streamflow forecasting.

Figure 5: Original digital elevation model with hillshade overlay
Figure 6: Depression-filled DEM using RichDEM - Elevation data with small pits and sinks removed
Figure 7: D8 flow accumulation on depression-filled DEM - Predicted water flow patterns where brighter areas indicate higher flow accumulation.

Citation

BibTeX citation:
@online{sarang2024,
  author = {Sarang, Nima},
  title = {Water {Supply} {Forecast} {Rodeo}},
  date = {2024-10-15},
  url = {https://www.nimasarang.com/project/2024-10-15-water-supply-forecast-rodeo/},
  langid = {en}
}
For attribution, please cite this work as:
Sarang, Nima. 2024. “Water Supply Forecast Rodeo.” October 15, 2024. https://www.nimasarang.com/project/2024-10-15-water-supply-forecast-rodeo/.