Architecture

The full Constellation Research Stack architecture is captured in the canonical Notion doc:

This page mirrors the Virgo-specific section so the deployed docs site can stand on its own.

Goals

  1. Composable transforms. Each transform is a Pydantic class — the class itself is the transform, its fields are the config, the class name is the transform name. Outputs of one feed inputs of the next.

  2. Three-axis cache. Input data identity (incl. time slice) × code version × config hash.

  3. Temporally granular cache. “Reprocess windows [t₁,t₂] with config A; load the rest from a cached config B run” is a single API call.

  4. Per-transform compute. Each transform declares CPU/RAM/GPU requirements; the DAG executor dispatches each node to the right hardware.

  5. Standard pipeline. A versioned default pipeline runs automatically on every newly-ingested recording.

  6. Local + cloud execution parity. Same pipeline runs on Polaris (Slurm + Ray) or in the cloud (SkyPilot + Ray).

  7. Observability. Every run emits structured events: which assets ran, which were cache hits, how long, peak memory, $.

Stack choices

  • Orchestrator: Dagster (Software-Defined Assets — code declares the desired state of derived datasets)

  • Versioning: Metaxy layered on Dagster for sample-level + field-level incremental versioning

  • Executor: Ray for parallel asset materialization; SkyPilot wraps Ray for cloud bursts; Slurm on Polaris (SkyPilot has first-class Slurm support)

  • Storage of derived assets: Ursa (the derived_assets table). Virgo never owns its own storage.

Transforms are Pydantic classes

from typing import ClassVar
from virgo import Transform, Resources, CacheKey, Local

class EEGBandpass(Transform):
    """Apply a Butterworth bandpass filter to an EEG signal."""

    # Pydantic config fields (per-instance)
    low_hz: float = 1.0
    high_hz: float = 40.0
    order: int = 4

    # ClassVar metadata (static across instances)
    version: ClassVar[str] = "0.4.1"
    compute: ClassVar[Resources] = Resources(cpu=4, ram_gb=8)
    cache:   ClassVar[CacheKey]  = CacheKey(time_locality=Local(window=2.0))

    def run(self, eeg):
        ...

Identity in the cache key: f"{class.__module__}.{class.__qualname__}@{version}". Pipelines compose transform instances:

from virgo import Pipeline
from virgo.transforms.eeg import EEGBandpass
from virgo.transforms.video import VideoFaceLandmarks

pipeline = Pipeline([
    EEGBandpass(low_hz=1.0, high_hz=40.0),
    VideoFaceLandmarks(min_confidence=0.8),
])

Cache key

asset_key = sha256(
    transform.identity +             # f"{module}.{class}@{version}"
    inputs.content_hash +            # recursively, the keys of upstream assets
    config.canonical_hash +          # Pydantic dump_json with sort_keys
    code.module_hash                 # SHA of the transform's source at the resolved version
)

Cache hits resolve via Ursa’s derived_assets table. Misses enqueue an asset materialization. Metaxy adds field-level invalidation so a code change that affects only one output field re-runs only that field’s path.

Standard pipeline

A canonical, versioned Pipeline lives in virgo/standard.py and exposes its identifier as the constant STANDARD_V2026Q2. When Ursa ingests a new recording, it triggers a Virgo job for the current standard pipeline; output assets land in derived_assets with pipeline_version="standard_v2026q2". Researchers query pipeline_version="standard_v2026q2" to get the curated view.

When the standard version bumps, a hybrid eager/lazy backfill runs: eager for any recording whose last_accessed_at is in the last 60 days; lazy on-demand for everything else.

Module version routing

A config that names an old version of a transform actually executes against that historical source code. Each versioned class lives in a git-tracked file; a version_index.json maps f"{module}.{class}@{version}" to a (git_ref, lockfile_ref) tuple. The orchestrator picks in-process or subprocess execution based on dep compatibility.

Phasing

See Linear for issue-level detail.