Version Control and Provenance

Tracking Workflows and Metadata with CMF

A key part of FDP is helping users track what they did, when they did it, and how it connects to the data. This is handled by the Common Metadata Framework (CMF), which manages version control for data, code, and metadata while also tracking the full history of a workflow.

Local and Public Repos

Each user works in a local CMF repository, which is like a personal Git clone. These repos keep track of everything: scripts, workflow configs, datasets, and metadata. Under the hood, CMF uses Git and DVC for version control, with SQLite for fast metadata access.

At the facility level, even raw data can be wrapped in DVC and managed in the same way, so data processing is always tightly linked to its inputs.

When users are ready to share, they push commits to a public CMF repository—a read-only location that acts like a GitHub mirror. These public repos can live on multiple remotes and are what MetaHub indexes for search and download.

This setup makes it easy to work privately, then contribute reproducible, curated outputs to the wider community.

Workflow Graphs (Provenance)

CMF stores each workflow as a directed acyclic graph (DAG). Each node is a step—like a script or a data file—and edges show how things are connected. This makes it easy to retrace your steps, compare runs, or re-run only the parts that changed.

The DAG format also helps with search and metadata extraction, since everything is explicitly linked.

Metadata and Interoperability

CMF captures a mix of static info (like shot numbers or diagnostic names) and dynamic info (like parameters, software versions, or environment details). This metadata is versioned and searchable, and can be linked directly to published results.

The framework is also designed to map to other standards, like IMAS, so the metadata can be used across other systems and tools.