This repo is used to train large state-of-the-art graph neural networks from scratch on datasets like OC20, OMol25, or OMat24, among others.
FAIRChem Training Framework Overview¶
The FAIRChem training framework currently uses a simple SPMD (Single Program Multiple Data) paradigm. It is made of several components:
User CLI and Launcher - The
fairchemCLI can run jobs locally using torch distributed elastic or on SLURM. More environments may be supported in the future.Configuration - We strictly use Hydra YAMLs for configuration.
Runner Interface - The core program code that is replicated to run on all ranks. An optional Reducer is also available for evaluation jobs. Runners are distinct user functions that run on a single rank (i.e., GPU). They describe separate high-level tasks such as Train, Eval, Predict, Relaxations, MD, etc. Anyone can write a new runner if its functionality is sufficiently different than the ones that already exist.
Trainer - We use TorchTNT as a light-weight training loop. This allows us to cleanly separate the data loading from the training loop.
FAIRChem v2 CLI¶
FAIRChem uses a single CLI for running jobs. It accepts a single argument, the location of the Hydra YAML.
The CLI can launch jobs locally using torch distributed elastic OR on SLURM.
FAIRChem v2 Config Structure¶
A FAIRChem config is composed of only 2 valid top-level keys: job (Job Config) and runner (Runner Config). Additionally, you can add key/values that are used by the OmegaConf interpolation syntax to replace fields. Other than these, no other top-level keys are permitted.
JobConfig represents configuration parameters that describe the overall job (mostly infra parameters) such as number of nodes, log locations, loggers, etc. This is a structured config and must strictly adhere to the JobConfig class.
Runner Config describes the user code. This part of config is recursively instantiated at the start of a job using Hydra instantiation framework.
Example Configurations¶
Local run:
job:
device_type: CUDA
scheduler:
mode: LOCAL
ranks_per_node: 4
run_name: local_training_runSLURM run:
job:
device_type: CUDA
scheduler:
mode: SLURM
ranks_per_node: 8
num_nodes: 4
slurm:
account: ${cluster.account}
qos: ${cluster.qos}
mem_gb: ${cluster.mem_gb}
cpus_per_task: ${cluster.cpus_per_task}
run_dir: /path/to/output
run_name: slurm_run_exampleConfig Object Instantiation¶
To keep our configs explicit (configs should be thought of as an extension of code), we prefer to use the Hydra instantiation framework throughout; the config is always fully described by a corresponding Python class and should never be a standalone dictionary.
Good vs Bad Config Patterns
Bad pattern - We have no idea where to find the code that uses runner or where variables x and y are actually used:
runner:
x: 5
y: 6Good pattern - Now we know which class runner corresponds to and that x, y are just initializer variables of runner. If we need to check the definition or understand the code, we can simply go to runner.py:
runner:
_target_: fairchem.core.components.runner.Runner
x: 5
y: 6Runtime Instantiation with Partial Functions¶
While we want to use static instantiation as much as possible, there will be many cases where certain objects require runtime inputs to create. For example, if we want to create a PyTorch optimizer, we can give it all the arguments except the model parameters (because it’s only known at runtime).
optimizer:
_target_: torch.optim.AdamW
params: ?? # this is only known at runtime
lr: 8e-4
weight_decay: 1e-3optimizer_fn:
_target_: torch.optim.AdamW
_partial_: true
lr: 8e-4
weight_decay: 1e-3# later in the runner
optimizer = optimizer_fn(model.parameters())Training UMA¶
The UMA model is completely defined here. It is also called “escn_md” during internal development since it was based on the eSEN architecture.
Training, evaluation, and inference are all defined in the mlip unit.
To train a model, we need to initialize a TrainRunner with a MLIPTrainEvalUnit.
Due to the complexity of UMA and training a multi-architecture, multi-dataset, multi-task model, we leverage config groups syntax in Hydra to organize UMA training into the following sections:
backbone - selects the specific backbone architecture (e.g., uma-sm, uma-md, uma-large)
cluster - quickly switch settings between different SLURM clusters or local environment
dataset - select the dataset to train on
element_refs - select the element references
tasks - select the task set (e.g., for direct or conservative training)
Example Commands¶
Get training started locally using local settings and the debug dataset:
fairchem -c configs/uma/training_release/uma_sm_direct_pretrain.yaml cluster=h100_local dataset=uma_debugTrain UMA conservative with 16 nodes on SLURM:
fairchem -c configs/uma/training_release/uma_sm_conserve_finetune.yaml cluster=h100 job.scheduler.num_nodes=16 run_name="uma_conserve_train"