Training Models from Scratch - FAIR Chemistry Documentation

This repo is used to train large state-of-the-art graph neural networks from scratch on datasets like OC20, OMol25, or OMat24, among others.

FAIRChem Training Framework Overview¶

The FAIRChem training framework currently uses a simple SPMD (Single Program Multiple Data) paradigm. It is made of several components:

User CLI and Launcher - The fairchem CLI can run jobs locally using torch distributed elastic or on SLURM. More environments may be supported in the future.
Configuration - We strictly use Hydra YAMLs for configuration.
Runner Interface - The core program code that is replicated to run on all ranks. An optional Reducer is also available for evaluation jobs. Runners are distinct user functions that run on a single rank (i.e., GPU). They describe separate high-level tasks such as Train, Eval, Predict, Relaxations, MD, etc. Anyone can write a new runner if its functionality is sufficiently different than the ones that already exist.
Trainer - We use TorchTNT as a light-weight training loop. This allows us to cleanly separate the data loading from the training loop.

FAIRChem v2 CLI¶

FAIRChem uses a single CLI for running jobs. It accepts a single argument, the location of the Hydra YAML.

The CLI can launch jobs locally using torch distributed elastic OR on SLURM.

FAIRChem v2 Config Structure¶

A FAIRChem config is composed of only 2 valid top-level keys: job (Job Config) and runner (Runner Config). Additionally, you can add key/values that are used by the OmegaConf interpolation syntax to replace fields. Other than these, no other top-level keys are permitted.

JobConfig represents configuration parameters that describe the overall job (mostly infra parameters) such as number of nodes, log locations, loggers, etc. This is a structured config and must strictly adhere to the JobConfig class.
Runner Config describes the user code. This part of config is recursively instantiated at the start of a job using Hydra instantiation framework.

Example Configurations¶

Local run:

job:
  device_type: CUDA
  scheduler:
    mode: LOCAL
    ranks_per_node: 4
  run_name: local_training_run

SLURM run:

job:
  device_type: CUDA
  scheduler:
    mode: SLURM
    ranks_per_node: 8
    num_nodes: 4
    slurm:
      account: ${cluster.account}
      qos: ${cluster.qos}
      mem_gb: ${cluster.mem_gb}
      cpus_per_task: ${cluster.cpus_per_task}
  run_dir: /path/to/output
  run_name: slurm_run_example

Config Object Instantiation¶

To keep our configs explicit (configs should be thought of as an extension of code), we prefer to use the Hydra instantiation framework throughout; the config is always fully described by a corresponding Python class and should never be a standalone dictionary.

Good vs Bad Config Patterns

Bad pattern - We have no idea where to find the code that uses runner or where variables x and y are actually used:

runner:
  x: 5
  y: 6

Good pattern - Now we know which class runner corresponds to and that x, y are just initializer variables of runner. If we need to check the definition or understand the code, we can simply go to runner.py:

runner:
  _target_: fairchem.core.components.runner.Runner
  x: 5
  y: 6

Runtime Instantiation with Partial Functions¶

While we want to use static instantiation as much as possible, there will be many cases where certain objects require runtime inputs to create. For example, if we want to create a PyTorch optimizer, we can give it all the arguments except the model parameters (because it’s only known at runtime).

optimizer:
  _target_: torch.optim.AdamW
  params: ?? # this is only known at runtime
  lr: 8e-4
  weight_decay: 1e-3

optimizer_fn:
  _target_: torch.optim.AdamW
  _partial_: true
  lr: 8e-4
  weight_decay: 1e-3

# later in the runner
optimizer = optimizer_fn(model.parameters())

Training UMA¶

The UMA model is completely defined here. It is also called “escn_md” during internal development since it was based on the eSEN architecture.

Training, evaluation, and inference are all defined in the mlip unit.

To train a model, we need to initialize a TrainRunner with a MLIPTrainEvalUnit.

Due to the complexity of UMA and training a multi-architecture, multi-dataset, multi-task model, we leverage config groups syntax in Hydra to organize UMA training into the following sections:

backbone - selects the specific backbone architecture (e.g., uma-sm, uma-md, uma-large)
cluster - quickly switch settings between different SLURM clusters or local environment
dataset - select the dataset to train on
element_refs - select the element references
tasks - select the task set (e.g., for direct or conservative training)

Example Commands¶

Get training started locally using local settings and the debug dataset:

fairchem -c configs/uma/training_release/uma_sm_direct_pretrain.yaml cluster=h100_local dataset=uma_debug

Train UMA conservative with 16 nodes on SLURM:

fairchem -c configs/uma/training_release/uma_sm_conserve_finetune.yaml cluster=h100 job.scheduler.num_nodes=16 run_name="uma_conserve_train"