Streamflow Datasets & Specifications¶

DDR is designed to route lateral inflow from a large number of unit-catchments across many timesteps. To accommodate diverse data sources, DDR uses a uniform input specification.

Overview¶

DDR requires three main types of input data:

Lateral Inflow (Q' or Q_l): Runoff predictions from unit catchments
Geospatial Fabric: River network topology and channel properties
Observations: Streamflow measurements for training/validation

Lateral Inflow Specification¶

Data Format¶

Lateral inflow data should be provided as an Icechunk store or zarr array with the following structure:

import xarray as xr

# Expected dimensions and coordinates
ds = xr.Dataset(
    data_vars={
        "Qr": (["time", "divide_id"], qr_data),  # Lateral inflow in m³/s
    },
    coords={
        "time": time_index,           # Daily timestamps
        "divide_id": divide_ids,      # Catchment identifiers
    },
    attrs={
        "units": "m^3/s",
        "source": "your_model_name",
    }
)

Requirements¶

Property	Requirement
Units	Cubic meters per second (m³/s)
Temporal resolution	Daily (interpolated to hourly internally)
Spatial coverage	All divide_ids in the routing domain
Missing values	Fill with small positive value (e.g., 1e-6)
Negative values	Not allowed

Unit Conversion¶

If your data is in mm/day, convert using drainage area:

# Convert mm/day to m³/s
# mm/day * km² * 1000 / 86400 = m³/s
conversion_factor = area_km2 * 1000 / 86400
qr_m3_s = runoff_mm_day * conversion_factor

Supported Data Sources¶

DDR provides download scripts for several pre-computed lateral inflow products:

Source	Coverage	Period	Location
dHBV2.0 (Hydrofabric v2.2)	CONUS	1980-2020	`s3://mhpi-spatial/hydrofabric_v2.2_dhbv_retrospective`
dHBV2.0 (MERIT)	CONUS	1980-2020	Zenodo

Geospatial Data Requirements¶

Hydrofabric v2.2¶

The NOAA-OWP Hydrofabric v2.2 is the recommended geospatial dataset for CONUS applications.

Required Layers:

Layer	Description
`flowpaths`	River reaches with topology (id, toid)
`flowpath-attributes-ml`	Channel properties (length, slope, width)
`network`	Network connectivity including gauge locations

Required Attributes:

# Flowpath attributes
flowpath_attrs = [
    "id",              # Waterbody identifier (wb-XXXXX)
    "toid",            # Downstream identifier
    "Length_m",        # Channel length in meters
    "So",              # Channel slope (dimensionless)
    "TopWdth",         # Top width in meters
    "ChSlp",           # Channel side slope
    "MusX",            # Muskingum X parameter
]

MERIT Hydro¶

MERIT Hydro provides global coverage with variable resolution.

Required Attributes:

Attribute	Description
`COMID`	Unique catchment identifier
`NextDownID`	Downstream COMID
`up1`-`up4`	Upstream COMID connections
`lengthkm`	Channel length in kilometers
`slope`	Channel slope
`unitarea`	Catchment area in km²

Connectivity Format¶

DDR uses sparse COO (Coordinate) matrices to represent river network connectivity:

# Adjacency matrix structure
# Rows: downstream segments
# Columns: upstream segments
# Values: 1 (connected) or 0 (not connected)

# Lower triangular: ensures topological ordering
# A[i, j] = 1 means flow goes from segment j to segment i

The matrices are stored in zarr v3 format following the Binsparse COO specification. See the binsparse documentation for details on:

Reading and writing adjacency matrices programmatically
Order converters for MERIT and Lynker ID formats
Per-gauge subset structure

The engine scripts automatically create these matrices:

# Creates:
# - hydrofabric_v2.2_conus_adjacency.zarr (full network)
# - hydrofabric_v2.2_gages_conus_adjacency.zarr (per-gauge subsets)

uv run python engine/scripts/build_hydrofabric_v2.2_matrices.py \
    /path/to/conus_nextgen.gpkg data/

Catchment Attributes¶

The neural network requires catchment attributes to predict routing parameters.

Default Attribute Set (v0.5.2)¶

The current architecture uses 10 catchment attributes:

Lynker HydrofabricMERIT

input_var_names:
  - SoilGrids1km_clay    # Clay content (%)
  - aridity              # Aridity index
  - meanelevation        # Mean elevation (m)
  - meanP                # Mean Annual Precipitation (mm)
  - NDVI                 # Normalized Difference Vegetation Index
  - meanslope            # Mean slope (m/km)
  - log_uparea           # Natural log of upstream area (km²)
  - SoilGrids1km_sand    # Sand content (%)
  - ETPOT_Hargr          # Potential evapotranspiration (Hargreaves)
  - Porosity             # Soil porosity

input_var_names:
  - SoilGrids1km_clay
  - aridity
  - meanelevation
  - meanP
  - NDVI
  - meanslope
  - log10_uparea         # Log10 of upstream area (note: different from Lynker)
  - SoilGrids1km_sand
  - ETPOT_Hargr
  - Porosity

These attributes can be either calculated by hand, or provided for you through s3://mhpi-spatial/hydrofabric_v2.2_attributes/ for the Lynker Hydrofabric. Catchment attributes are geodataset specific.

Note

MERIT uses log10_uparea while Lynker Hydrofabric uses log_uparea. These are different transformations of upstream drainage area.

Attribute Storage¶

Attributes should be stored in an Icechunk/zarr store if possible. NetCDF support is used in MERIT:

# Hydrofabric v2.2 format
ds = xr.Dataset(
    data_vars={
        "aridity": (["divide_id"], aridity_values),
        "elev_mean": (["divide_id"], elev_values),
        # ... other attributes
    },
    coords={
        "divide_id": divide_ids,  # Format: "cat-XXXXX"
    }
)

Normalization¶

DDR automatically computes and caches normalization statistics:

# Statistics stored per attribute
{
    "min": min_value,
    "max": max_value,
    "mean": mean_value,
    "std": std_value,
    "p10": 10th_percentile,
    "p90": 90th_percentile,
}

Statistics are cached to data/statistics/{geodataset}_attribute_statistics_{store_name}.json.

Observations¶

USGS Streamflow Data¶

DDR uses USGS streamflow observations for training and validation.

Data Format:

ds = xr.Dataset(
    data_vars={
        "streamflow": (["time", "gage_id"], flow_values),  # m³/s
    },
    coords={
        "time": time_index,
        "gage_id": gage_ids,  # 8-digit zero-padded strings
    }
)

Pre-formatted Observations:

DDR provides access to pre-processed USGS data:

observations: "s3://mhpi-spatial/usgs_streamflow_observations/"

Gauge Information CSV¶

Training requires a gauge information file:

STAID,STANAME,DRAIN_SQKM,LAT_GAGE,LNG_GAGE
01563500,Susquehanna River at Harrisburg,62419,40.2548,-76.8867
01570500,Susquehanna River at Sunbury,46706,40.8576,-76.7944

Required Columns:

Column	Description	Format
`STAID`	Station ID	8-digit, zero-padded
`DRAIN_SQKM`	Drainage area	km² (positive float)
`LAT_GAGE`	Latitude	Decimal degrees
`LNG_GAGE`	Longitude	Decimal degrees

Optional Columns:

Column	Description
`STANAME`	Station name
`COMID`	MERIT catchment ID (required for MERIT dataset)

Pre-prepared gauge lists are available in references/gage_info/:

File	Gages	Source
`camels_670.csv`	670	CAMELS / HCDN-2009 (Newman et al., 2014)
`gages_3000.csv`	3211	Ouyang et al., 2021
`GAGES-II.csv`	8945	GAGES-II (Falcone, 2011)

See references/gage_info/README.md for derivation details.

Data Sources¶

Pre-computed S3 Data¶

DDR provides access to pre-computed datasets on AWS S3:

Dataset	S3 Path	Description
HF v2.2 Attributes	`s3://mhpi-spatial/hydrofabric_v2.2_attributes/`	Catchment attributes
HF v2.2 Streamflow	`s3://mhpi-spatial/hydrofabric_v2.2_dhbv_retrospective`	dHBV2.0 predictions
USGS Observations	`s3://mhpi-spatial/usgs_streamflow_observations/`	Historical streamflow

Access is anonymous (no AWS credentials required):

from ddr.io.readers import read_ic

# Read from S3
ds = read_ic("s3://mhpi-spatial/hydrofabric_v2.2_attributes/", region="us-east-2")

Local Data¶

For local data, use file paths:

data_sources:
  attributes: "/path/to/local/attributes/"
  streamflow: "/path/to/local/streamflow/"

Preparing Custom Data¶

Creating Lateral Inflow Data¶

If you have your own runoff model, format the output for DDR:

import icechunk as ic
import xarray as xr
from icechunk.xarray import to_icechunk

# Load your model output
qr = load_your_model_output()  # shape: (n_catchments, n_timesteps)

# Create xarray Dataset
ds = xr.Dataset(
    data_vars={
        "Qr": (["divide_id", "time"], qr.astype(np.float32)),
    },
    coords={
        "divide_id": your_divide_ids,
        "time": pd.date_range("1980-01-01", periods=n_timesteps, freq="D"),
    },
    attrs={"units": "m^3/s"},
)

# Save to Icechunk
storage = ic.local_filesystem_storage("./my_streamflow_data")
repo = ic.Repository.create(storage)
session = repo.writable_session("main")
to_icechunk(ds, session)
session.commit("Initial commit")

External Resources¶

Gauge Lists: DeepGroundwater/datasets
MERIT Hydro: University of Tokyo