core.launchers.ray_on_slurm_launch

core.launchers.ray_on_slurm_launch#

Classes#

`SPMDWorker`
`SPMDController`	Represents an abstraction over things that run in a loop and can save/load state.

Functions#

`ray_entrypoint`(runner_config)
`ray_on_slurm_launch`(config, log_dir)

Module Contents#

class core.launchers.ray_on_slurm_launch.SPMDWorker(job_config: omegaconf.DictConfig, runner_config: omegaconf.DictConfig, worker_id: int, world_size: int, device: str, gp_size: int | None = None, master_addr: str | None = None, master_port: int | None = None)#

runner_config#

master_address#

master_port#

worker_id#

device#

gp_size#

world_size#

job_config#

distributed_setup = False#

_distributed_setup(worker_id: int, world_size: int, master_address: str, master_port: int, device: str, gp_size: int | None)#

get_master_address_and_port()#

run()#

class core.launchers.ray_on_slurm_launch.SPMDController(job_config: omegaconf.DictConfig, runner_config: omegaconf.DictConfig)#

Bases: fairchem.core.components.runner.Runner

Represents an abstraction over things that run in a loop and can save/load state.

ie: Trainers, Validators, Relaxation all fall in this category.

Note

When running with the fairchemv2 cli, the job_config and attribute is set at runtime to those given in the config file.

job_config#

a managed attribute that gives access to the job config

Type:: DictConfig

job_config#

runner_config#

device#

world_size#

gp_group_size#

ranks_per_node#

num_nodes#

workers#

run()#

save_state(checkpoint_location: str, is_preemption: bool = False) → bool#

load_state(checkpoint_location: str | None) → None#

core.launchers.ray_on_slurm_launch.ray_entrypoint(runner_config: omegaconf.DictConfig)#

core.launchers.ray_on_slurm_launch.ray_on_slurm_launch(config: omegaconf.DictConfig, log_dir: str)#