core.launchers.ray_on_slurm_launch#
Classes#
Represents an abstraction over things that run in a loop and can save/load state. |
Functions#
|
|
|
Module Contents#
- class core.launchers.ray_on_slurm_launch.SPMDWorker(job_config: omegaconf.DictConfig, runner_config: omegaconf.DictConfig, worker_id: int, world_size: int, device: str, gp_size: int | None = None, master_addr: str | None = None, master_port: int | None = None)#
- runner_config#
- master_address#
- master_port#
- worker_id#
- device#
- gp_size#
- world_size#
- job_config#
- distributed_setup = False#
- _distributed_setup(worker_id: int, world_size: int, master_address: str, master_port: int, device: str, gp_size: int | None)#
- get_master_address_and_port()#
- run()#
- class core.launchers.ray_on_slurm_launch.SPMDController(job_config: omegaconf.DictConfig, runner_config: omegaconf.DictConfig)#
Bases:
fairchem.core.components.runner.RunnerRepresents an abstraction over things that run in a loop and can save/load state.
ie: Trainers, Validators, Relaxation all fall in this category.
Note
When running with the fairchemv2 cli, the job_config and attribute is set at runtime to those given in the config file.
- job_config#
a managed attribute that gives access to the job config
- Type:
DictConfig
- job_config#
- runner_config#
- device#
- world_size#
- gp_group_size#
- ranks_per_node#
- num_nodes#
- workers#
- run()#
- save_state(checkpoint_location: str, is_preemption: bool = False) bool#
- load_state(checkpoint_location: str | None) None#
- core.launchers.ray_on_slurm_launch.ray_entrypoint(runner_config: omegaconf.DictConfig)#
- core.launchers.ray_on_slurm_launch.ray_on_slurm_launch(config: omegaconf.DictConfig, log_dir: str)#