Architecture¶
The full Constellation Research Stack architecture is captured in the canonical Notion doc:
This page mirrors the Virgo-specific section so the deployed docs site can stand on its own.
Goals¶
Composable transforms. Each transform is a Pydantic class — the class itself is the transform, its fields are the config, the class name is the transform name. Outputs of one feed inputs of the next.
Three-axis cache. Input data identity (incl. time slice) × code version × config hash.
Temporally granular cache. “Reprocess windows [t₁,t₂] with config A; load the rest from a cached config B run” is a single API call.
Per-transform compute. Each transform declares CPU/RAM/GPU requirements; the DAG executor dispatches each node to the right hardware.
Standard pipeline. A versioned default pipeline runs automatically on every newly-ingested recording.
Local + cloud execution parity. Same pipeline runs on Polaris (Slurm + Ray) or in the cloud (SkyPilot + Ray).
Observability. Every run emits structured events: which assets ran, which were cache hits, how long, peak memory, $.
Stack choices¶
Orchestrator: Dagster (Software-Defined Assets — code declares the desired state of derived datasets)
Versioning: Metaxy layered on Dagster for sample-level + field-level incremental versioning
Executor: Ray for parallel asset materialization; SkyPilot wraps Ray for cloud bursts; Slurm on Polaris (SkyPilot has first-class Slurm support)
Storage of derived assets: Ursa (the
derived_assetstable). Virgo never owns its own storage.
Transforms are Pydantic classes¶
from typing import ClassVar
from virgo import Transform, Resources, CacheKey, Local
class EEGBandpass(Transform):
"""Apply a Butterworth bandpass filter to an EEG signal."""
# Pydantic config fields (per-instance)
low_hz: float = 1.0
high_hz: float = 40.0
order: int = 4
# ClassVar metadata (static across instances)
version: ClassVar[str] = "0.4.1"
compute: ClassVar[Resources] = Resources(cpu=4, ram_gb=8)
cache: ClassVar[CacheKey] = CacheKey(time_locality=Local(window=2.0))
def run(self, eeg):
...
Identity in the cache key: f"{class.__module__}.{class.__qualname__}@{version}". Pipelines compose transform instances:
from virgo import Pipeline
from virgo.transforms.eeg import EEGBandpass
from virgo.transforms.video import VideoFaceLandmarks
pipeline = Pipeline([
EEGBandpass(low_hz=1.0, high_hz=40.0),
VideoFaceLandmarks(min_confidence=0.8),
])
Cache key¶
asset_key = sha256(
transform.identity + # f"{module}.{class}@{version}"
inputs.content_hash + # recursively, the keys of upstream assets
config.canonical_hash + # Pydantic dump_json with sort_keys
code.module_hash # SHA of the transform's source at the resolved version
)
Cache hits resolve via Ursa’s derived_assets table. Misses enqueue an asset materialization. Metaxy adds field-level invalidation so a code change that affects only one output field re-runs only that field’s path.
Standard pipeline¶
A canonical, versioned Pipeline lives in virgo/standard.py and exposes its identifier as the constant STANDARD_V2026Q2. When Ursa ingests a new recording, it triggers a Virgo job for the current standard pipeline; output assets land in derived_assets with pipeline_version="standard_v2026q2". Researchers query pipeline_version="standard_v2026q2" to get the curated view.
When the standard version bumps, a hybrid eager/lazy backfill runs: eager for any recording whose last_accessed_at is in the last 60 days; lazy on-demand for everything else.
Module version routing¶
A config that names an old version of a transform actually executes against that historical source code. Each versioned class lives in a git-tracked file; a version_index.json maps f"{module}.{class}@{version}" to a (git_ref, lockfile_ref) tuple. The orchestrator picks in-process or subprocess execution based on dep compatibility.
Phasing¶
See Linear for issue-level detail.