Subsystems
How FDP is Structured
FDP is built from a set of modular components that handle data ingestion, processing, curation, and publishing. These components are grouped into three main areas: public-facing services, tools for building and running workflows, and infrastructure for accessing facility data.
Public Interfaces
MetaHub is the web portal where users can search, browse, and download datasets, workflows, and metadata. It's designed for people who want to use curated data but aren’t necessarily building workflows themselves.
The CMF Public Repository is a read-only mirror of internal repositories, built using Git and Data Version Control (DVC), backed by PostgreSQL. It’s the source of truth for anything shared publicly and helps guarantee reproducibility and provenance.
The Data Sharing Interface connects FDP to external computing sites through OSDF. These systems can access FDP-hosted data using the Pelican platform, which finds the closest cached copy of what’s needed and serves it efficiently.
Workflow Development
This part of FDP is where users create and run pipelines that operate on cached experimental data:
- Curation Tools let users tag, label, and annotate datasets, either manually or through automated processes.
- CMF Local Repository is where users store code, metadata, and data artifacts during development. It's based on Git, DVC, and SQLite.
- Facility Data Access provides read-only access to raw data. Tools like TokSearch connect to this interface to pull data into workflows.
- User-Defined Workflows are graphs of processing steps—scripts, model training, transformations—built using platform APIs like TokSearch, the CMF API, and familiar libraries like NumPy or PyTorch.
How It All Connects
Workflows write their outputs to the local repository, which can later be pushed to the public repo. MetaHub reads from these public repositories to index and expose the latest datasets and workflows. TokSearch workflows pull directly from the facility cache, which mirrors the experimental data served by fusion facilities.