1 The Competition
I participated in the Water Supply Forecast Rodeo, an ML competition with the objective of forecasting water supply in the western United States. I was able to make it to the third (final) round, where our models were evaluated in real-time on unseen data from Jan to Jun 2024
. I was able to achieve the 5th place and overall it was a great experience. There were other aspects to the competition such as writing and presenting the implementation details, but my only focus was on the forecasting aspect.
One of the biggest challenges for me was the variety of datasets and their different formats, which was overwhelming. On one hand, it was valuable to learn about different climate and hydrology datasets and spend time on pre-processing, feature engineering, and feature selection. On the other hand, I would’ve preferred to spend more time on the actual modeling. Interestingly enough, the model I ended up submitting was something I had created as a baseline to compare against the actual models I was planning to build. My final model, which was a neural network, wasn’t ready before the deadline. I was surprised that the baseline model was able to achieve such a good score.
I put all of my development work on Github here. The latest commit has also my incomplete work which didn’t make it to the final submission. If you want to see the codebase in the state when I created the baseline model, you can checkout to git checkout v1.0.0
.
If you’ve read so far, I wanted to introduce the problem and the datasets that I used.
2 Problem Statement
The challenge was pretty straightforward on paper: predict seasonal water supply at 26 sites across the western United States. Specifically, we needed to forecast the cumulative streamflow volume (measured in thousand acre-feet) for the typical forecast season of April through July. The tricky part was that we had to issue these forecasts multiple times throughout the year - on the 1st, 8th, 15th, and 22nd of each month from January all the way through July.
What made this interesting was that we weren’t just predicting a single number. Instead, we had to provide quantile forecasts - the 0.10, 0.50 (median), and 0.90 quantiles, basically giving a range of possible outcomes. The streamflow data we were predicting represented “natural flow” - essentially what the water flow would’ve been without any human interference like dams or diversions.
The following interactive map shows the locations and boundaries of the river basins for all 26 sites. I’ve also included supplementary NRCS sites to provide a broader context of the monitoring network and its geographical distribution in the region.
3 Datasets
The competition provided access to a ridiculous amount of data sources. Here’s what was available:
Streamflow & Antecedent Data:
- NRCS/RFCs monthly naturalized flow - historical monthly naturalized flow at forecast sites
- USGS streamflow - daily observed streamflow measurements from USGS streamgages
- USBR reservoir inflow - inflow data into Bureau of Reclamation reservoirs
Snowpack Data:
- NRCS SNOTEL - automated snow telemetry from 900+ high-elevation sites
- CDEC Snow Sensor Network - California’s snowpack monitoring stations
- SNODAS - daily 1km snow data assimilation system covering continental US
- UA/SWANN - University of Arizona’s neural network-based snow water equivalent data
- MODIS Snow Cover - satellite-based snow cover data at 500m resolution
Weather & Climate:
- RCC-ACIS - historical and near real-time climate observations
- CPC Seasonal Outlooks - seasonal temperature and precipitation forecasts
- Copernicus seasonal forecasts - multi-system seasonal meteorological forecasts
- ERA5-Land reanalysis - global reanalysis of land variables
- NLDAS-2 forcing data - meteorological variables from North American Land Data Assimilation System
- NCEP/NCAR Reanalysis - gridded atmospheric and land variables
- USGS SSEBop Evapotranspiration - satellite-based evapotranspiration estimates
Drought & Moisture:
- Palmer Drought Severity Index - gridded drought index from gridMET data
- GRACE soil moisture - satellite-based soil moisture and groundwater indicators
Climate Indices:
- Oceanic Niño Index - El Niño/La Niña indicator from sea surface temperatures
- Southern Oscillation Index - pressure-based ENSO indicator
- Pacific Decadal Oscillation - longer-term Pacific climate pattern
- Madden-Julian Oscillation - tropical weather pattern indices
- Pacific North American Index - large-scale atmospheric circulation pattern
Vegetation & Land:
- MODIS Vegetation Indices - satellite-based vegetation measurements
- Copernicus DEM - 90m resolution digital elevation model
- NLCD Urban Imperviousness - water-resistant surface measurements
- BasinATLAS - comprehensive basin attributes database
3.1 Examples
Let’s look into some of the datasets. It took me a while to create the visualizations when I was working on the competition, so I thought it would be nice to share them here.

Next is hydrological analysis of a sample basin using the Copernicus Digital Elevation Model (DEM). These visualizations demonstrate how raw elevation data is processed to model water flow paths, which is a key step in understanding watershed behavior for streamflow forecasting.



Citation
@online{sarang2024,
author = {Sarang, Nima},
title = {Water {Supply} {Forecast} {Rodeo}},
date = {2024-10-15},
url = {https://www.nimasarang.com/project/2024-10-15-water-supply-forecast-rodeo/},
langid = {en}
}