core.launchers.ray_on_slurm_launch#

Classes#

SPMDWorker

SPMDController

Represents an abstraction over things that run in a loop and can save/load state.

Functions#

ray_entrypoint(runner_config)

ray_on_slurm_launch(config, log_dir)

Module Contents#

class core.launchers.ray_on_slurm_launch.SPMDWorker(job_config: omegaconf.DictConfig, runner_config: omegaconf.DictConfig, worker_id: int, world_size: int, device: str, gp_size: int | None = None, master_addr: str | None = None, master_port: int | None = None)#
runner_config#
master_address#
master_port#
worker_id#
device#
gp_size#
world_size#
job_config#
distributed_setup = False#
_distributed_setup(worker_id: int, world_size: int, master_address: str, master_port: int, device: str, gp_size: int | None)#
get_master_address_and_port()#
run()#
class core.launchers.ray_on_slurm_launch.SPMDController(job_config: omegaconf.DictConfig, runner_config: omegaconf.DictConfig)#

Bases: fairchem.core.components.runner.Runner

Represents an abstraction over things that run in a loop and can save/load state.

ie: Trainers, Validators, Relaxation all fall in this category.

Note

When running with the fairchemv2 cli, the job_config and attribute is set at runtime to those given in the config file.

job_config#

a managed attribute that gives access to the job config

Type:

DictConfig

job_config#
runner_config#
device#
world_size#
gp_group_size#
ranks_per_node#
num_nodes#
workers#
run()#
save_state(checkpoint_location: str, is_preemption: bool = False) bool#
load_state(checkpoint_location: str | None) None#
core.launchers.ray_on_slurm_launch.ray_entrypoint(runner_config: omegaconf.DictConfig)#
core.launchers.ray_on_slurm_launch.ray_on_slurm_launch(config: omegaconf.DictConfig, log_dir: str)#