Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Training Models from Scratch

This repo is used to train large state-of-the-art graph neural networks from scratch on datasets like OC20, OMol25, or OMat24, among others.

FAIRChem Training Framework Overview

The FAIRChem training framework currently uses a simple SPMD (Single Program Multiple Data) paradigm. It is made of several components:

  1. User CLI and Launcher - The fairchem CLI can run jobs locally using torch distributed elastic or on SLURM. More environments may be supported in the future.

  2. Configuration - We strictly use Hydra YAMLs for configuration.

  3. Runner Interface - The core program code that is replicated to run on all ranks. An optional Reducer is also available for evaluation jobs. Runners are distinct user functions that run on a single rank (i.e., GPU). They describe separate high-level tasks such as Train, Eval, Predict, Relaxations, MD, etc. Anyone can write a new runner if its functionality is sufficiently different than the ones that already exist.

  4. Trainer - We use TorchTNT as a light-weight training loop. This allows us to cleanly separate the data loading from the training loop.

FAIRChem v2 CLI

FAIRChem uses a single CLI for running jobs. It accepts a single argument, the location of the Hydra YAML.

The CLI can launch jobs locally using torch distributed elastic OR on SLURM.

FAIRChem v2 Config Structure

A FAIRChem config is composed of only 2 valid top-level keys: job (Job Config) and runner (Runner Config). Additionally, you can add key/values that are used by the OmegaConf interpolation syntax to replace fields. Other than these, no other top-level keys are permitted.

Example Configurations

Local run:

job:
  device_type: CUDA
  scheduler:
    mode: LOCAL
    ranks_per_node: 4
  run_name: local_training_run

SLURM run:

job:
  device_type: CUDA
  scheduler:
    mode: SLURM
    ranks_per_node: 8
    num_nodes: 4
    slurm:
      account: ${cluster.account}
      qos: ${cluster.qos}
      mem_gb: ${cluster.mem_gb}
      cpus_per_task: ${cluster.cpus_per_task}
  run_dir: /path/to/output
  run_name: slurm_run_example

Config Object Instantiation

To keep our configs explicit (configs should be thought of as an extension of code), we prefer to use the Hydra instantiation framework throughout; the config is always fully described by a corresponding Python class and should never be a standalone dictionary.

Runtime Instantiation with Partial Functions

While we want to use static instantiation as much as possible, there will be many cases where certain objects require runtime inputs to create. For example, if we want to create a PyTorch optimizer, we can give it all the arguments except the model parameters (because it’s only known at runtime).

optimizer:
  _target_: torch.optim.AdamW
  params: ?? # this is only known at runtime
  lr: 8e-4
  weight_decay: 1e-3
optimizer_fn:
  _target_: torch.optim.AdamW
  _partial_: true
  lr: 8e-4
  weight_decay: 1e-3
# later in the runner
optimizer = optimizer_fn(model.parameters())

Training UMA

The UMA model is completely defined here. It is also called “escn_md” during internal development since it was based on the eSEN architecture.

Training, evaluation, and inference are all defined in the mlip unit.

To train a model, we need to initialize a TrainRunner with a MLIPTrainEvalUnit.

Due to the complexity of UMA and training a multi-architecture, multi-dataset, multi-task model, we leverage config groups syntax in Hydra to organize UMA training into the following sections:

Example Commands

Get training started locally using local settings and the debug dataset:

fairchem -c configs/uma/training_release/uma_sm_direct_pretrain.yaml cluster=h100_local dataset=uma_debug

Train UMA conservative with 16 nodes on SLURM:

fairchem -c configs/uma/training_release/uma_sm_conserve_finetune.yaml cluster=h100 job.scheduler.num_nodes=16 run_name="uma_conserve_train"