Streamflow Datasets & Specifications¶
DDR is designed to route lateral inflow from a large number of unit-catchments across many timesteps. To accommodate diverse data sources, DDR uses a uniform input specification.
Overview¶
DDR requires three main types of input data:
- Lateral Inflow (Q' or Q_l): Runoff predictions from unit catchments
- Geospatial Fabric: River network topology and channel properties
- Observations: Streamflow measurements for training/validation
Lateral Inflow Specification¶
Data Format¶
Lateral inflow data should be provided as an Icechunk store or zarr array with the following structure:
import xarray as xr
# Expected dimensions and coordinates
ds = xr.Dataset(
data_vars={
"Qr": (["time", "divide_id"], qr_data), # Lateral inflow in m³/s
},
coords={
"time": time_index, # Daily timestamps
"divide_id": divide_ids, # Catchment identifiers
},
attrs={
"units": "m^3/s",
"source": "your_model_name",
}
)
Requirements¶
| Property | Requirement |
|---|---|
| Units | Cubic meters per second (m³/s) |
| Temporal resolution | Daily (interpolated to hourly internally) |
| Spatial coverage | All divide_ids in the routing domain |
| Missing values | Fill with small positive value (e.g., 1e-6) |
| Negative values | Not allowed |
Unit Conversion¶
If your data is in mm/day, convert using drainage area:
# Convert mm/day to m³/s
# mm/day * km² * 1000 / 86400 = m³/s
conversion_factor = area_km2 * 1000 / 86400
qr_m3_s = runoff_mm_day * conversion_factor
Supported Data Sources¶
DDR provides download scripts for several pre-computed lateral inflow products:
| Source | Coverage | Period | Location |
|---|---|---|---|
| dHBV2.0 (Hydrofabric v2.2) | CONUS | 1980-2020 | s3://mhpi-spatial/hydrofabric_v2.2_dhbv_retrospective |
| dHBV2.0 (MERIT) | CONUS | 1980-2020 | Zenodo |
Geospatial Data Requirements¶
Hydrofabric v2.2¶
The NOAA-OWP Hydrofabric v2.2 is the recommended geospatial dataset for CONUS applications.
Required Layers:
| Layer | Description |
|---|---|
flowpaths |
River reaches with topology (id, toid) |
flowpath-attributes-ml |
Channel properties (length, slope, width) |
network |
Network connectivity including gauge locations |
Required Attributes:
# Flowpath attributes
flowpath_attrs = [
"id", # Waterbody identifier (wb-XXXXX)
"toid", # Downstream identifier
"Length_m", # Channel length in meters
"So", # Channel slope (dimensionless)
"TopWdth", # Top width in meters
"ChSlp", # Channel side slope
"MusX", # Muskingum X parameter
]
MERIT Hydro¶
MERIT Hydro provides global coverage with variable resolution.
Required Attributes:
| Attribute | Description |
|---|---|
COMID |
Unique catchment identifier |
NextDownID |
Downstream COMID |
up1-up4 |
Upstream COMID connections |
lengthkm |
Channel length in kilometers |
slope |
Channel slope |
unitarea |
Catchment area in km² |
Connectivity Format¶
DDR uses sparse COO (Coordinate) matrices to represent river network connectivity:
# Adjacency matrix structure
# Rows: downstream segments
# Columns: upstream segments
# Values: 1 (connected) or 0 (not connected)
# Lower triangular: ensures topological ordering
# A[i, j] = 1 means flow goes from segment j to segment i
The matrices are stored in zarr v3 format following the Binsparse COO specification. See the binsparse documentation for details on:
- Reading and writing adjacency matrices programmatically
- Order converters for MERIT and Lynker ID formats
- Per-gauge subset structure
The engine scripts automatically create these matrices:
# Creates:
# - hydrofabric_v2.2_conus_adjacency.zarr (full network)
# - hydrofabric_v2.2_gages_conus_adjacency.zarr (per-gauge subsets)
uv run python engine/scripts/build_hydrofabric_v2.2_matrices.py \
/path/to/conus_nextgen.gpkg data/
Catchment Attributes¶
The neural network requires catchment attributes to predict routing parameters.
Default Attribute Set (v0.5.2)¶
The current architecture uses 10 catchment attributes:
input_var_names:
- SoilGrids1km_clay # Clay content (%)
- aridity # Aridity index
- meanelevation # Mean elevation (m)
- meanP # Mean Annual Precipitation (mm)
- NDVI # Normalized Difference Vegetation Index
- meanslope # Mean slope (m/km)
- log_uparea # Natural log of upstream area (km²)
- SoilGrids1km_sand # Sand content (%)
- ETPOT_Hargr # Potential evapotranspiration (Hargreaves)
- Porosity # Soil porosity
These attributes can be either calculated by hand, or provided for you through s3://mhpi-spatial/hydrofabric_v2.2_attributes/ for the Lynker Hydrofabric. Catchment attributes are geodataset specific.
Note
MERIT uses log10_uparea while Lynker Hydrofabric uses log_uparea. These are different transformations of upstream drainage area.
Attribute Storage¶
Attributes should be stored in an Icechunk/zarr store if possible. NetCDF support is used in MERIT:
# Hydrofabric v2.2 format
ds = xr.Dataset(
data_vars={
"aridity": (["divide_id"], aridity_values),
"elev_mean": (["divide_id"], elev_values),
# ... other attributes
},
coords={
"divide_id": divide_ids, # Format: "cat-XXXXX"
}
)
Normalization¶
DDR automatically computes and caches normalization statistics:
# Statistics stored per attribute
{
"min": min_value,
"max": max_value,
"mean": mean_value,
"std": std_value,
"p10": 10th_percentile,
"p90": 90th_percentile,
}
Statistics are cached to data/statistics/{geodataset}_attribute_statistics_{store_name}.json.
Observations¶
USGS Streamflow Data¶
DDR uses USGS streamflow observations for training and validation.
Data Format:
ds = xr.Dataset(
data_vars={
"streamflow": (["time", "gage_id"], flow_values), # m³/s
},
coords={
"time": time_index,
"gage_id": gage_ids, # 8-digit zero-padded strings
}
)
Pre-formatted Observations:
DDR provides access to pre-processed USGS data:
Gauge Information CSV¶
Training requires a gauge information file:
STAID,STANAME,DRAIN_SQKM,LAT_GAGE,LNG_GAGE
01563500,Susquehanna River at Harrisburg,62419,40.2548,-76.8867
01570500,Susquehanna River at Sunbury,46706,40.8576,-76.7944
Required Columns:
| Column | Description | Format |
|---|---|---|
STAID |
Station ID | 8-digit, zero-padded |
DRAIN_SQKM |
Drainage area | km² (positive float) |
LAT_GAGE |
Latitude | Decimal degrees |
LNG_GAGE |
Longitude | Decimal degrees |
Optional Columns:
| Column | Description |
|---|---|
STANAME |
Station name |
COMID |
MERIT catchment ID (required for MERIT dataset) |
Pre-prepared gauge lists are available in references/gage_info/:
| File | Gages | Source |
|---|---|---|
camels_670.csv |
670 | CAMELS / HCDN-2009 (Newman et al., 2014) |
gages_3000.csv |
3211 | Ouyang et al., 2021 |
GAGES-II.csv |
8945 | GAGES-II (Falcone, 2011) |
See references/gage_info/README.md for derivation details.
Data Sources¶
Pre-computed S3 Data¶
DDR provides access to pre-computed datasets on AWS S3:
| Dataset | S3 Path | Description |
|---|---|---|
| HF v2.2 Attributes | s3://mhpi-spatial/hydrofabric_v2.2_attributes/ |
Catchment attributes |
| HF v2.2 Streamflow | s3://mhpi-spatial/hydrofabric_v2.2_dhbv_retrospective |
dHBV2.0 predictions |
| USGS Observations | s3://mhpi-spatial/usgs_streamflow_observations/ |
Historical streamflow |
Access is anonymous (no AWS credentials required):
from ddr.io.readers import read_ic
# Read from S3
ds = read_ic("s3://mhpi-spatial/hydrofabric_v2.2_attributes/", region="us-east-2")
Local Data¶
For local data, use file paths:
Preparing Custom Data¶
Creating Lateral Inflow Data¶
If you have your own runoff model, format the output for DDR:
import icechunk as ic
import xarray as xr
from icechunk.xarray import to_icechunk
# Load your model output
qr = load_your_model_output() # shape: (n_catchments, n_timesteps)
# Create xarray Dataset
ds = xr.Dataset(
data_vars={
"Qr": (["divide_id", "time"], qr.astype(np.float32)),
},
coords={
"divide_id": your_divide_ids,
"time": pd.date_range("1980-01-01", periods=n_timesteps, freq="D"),
},
attrs={"units": "m^3/s"},
)
# Save to Icechunk
storage = ic.local_filesystem_storage("./my_streamflow_data")
repo = ic.Repository.create(storage)
session = repo.writable_session("main")
to_icechunk(ds, session)
session.commit("Initial commit")
External Resources¶
- Gauge Lists: DeepGroundwater/datasets
- MERIT Hydro: University of Tokyo