core.launchers.cluster.ray_cluster#
Attributes#
Classes#
information about the head node that we can share to workers |
|
This class is responsible for managing the state of the Ray cluster. It is useful to keep track |
|
A RayCluster offers tools to start a Ray cluster (head and wokers) on slurm with the correct settings. |
Functions#
|
|
|
Cancel the SLURM jobs with the given job IDs. |
|
Create symlinks for the job's stdout and stderr in the target directory with a nicer name. |
|
Start the head node of the Ray cluster on slurm. |
|
start an array of worker nodes for the Ray cluster on slurm. Waiting on the head node first. |
Module Contents#
- core.launchers.cluster.ray_cluster.kill_proc_tree(pid, including_parent=True)#
- core.launchers.cluster.ray_cluster.find_free_port()#
- core.launchers.cluster.ray_cluster.scancel(job_ids: list[str])#
Cancel the SLURM jobs with the given job IDs.
This function takes a list of job IDs.
- Parameters:
job_ids (List[str]) – A list of job IDs to cancel.
- core.launchers.cluster.ray_cluster.start_ip_pattern = "ray start --address='([0-9\\.]+):([0-9]+)'"#
- core.launchers.cluster.ray_cluster.PayloadReturnT#
- core.launchers.cluster.ray_cluster.mk_symlinks(target_dir: pathlib.Path, job_type: str, paths: submitit.core.utils.JobPaths)#
Create symlinks for the job’s stdout and stderr in the target directory with a nicer name.
- class core.launchers.cluster.ray_cluster.HeadInfo#
information about the head node that we can share to workers
- hostname: str | None = None#
- port: int | None = None#
- temp_dir: str | None = None#
- class core.launchers.cluster.ray_cluster.RayClusterState(rdv_dir: pathlib.Path | None = None, cluster_id: str | None = None)#
This class is responsible for managing the state of the Ray cluster. It is useful to keep track of the head node and the workers, and to make sure they are all ready before starting the payload.
It relies on storing info in a rendezvous directory so they can be shared async between jobs.
- Parameters:
rdv_dir (Path) – The directory where the rendezvous information will be stored. Defaults to ~/.fairray.
cluster_id (str) – A unique identifier for the cluster. Defaults to a random UUID. You only want to set this if you want to connect to an existing cluster.
- rendezvous_rootdir#
- _cluster_id#
- property cluster_id: str#
Returns the unique identifier for the cluster.
- property rendezvous_dir: pathlib.Path#
Returns the path to the directory where the rendezvous information is stored.
- property jobs_dir: pathlib.Path#
Returns the path to the directory where job information is stored.
- property _head_json: pathlib.Path#
Returns the path to the JSON file containing head node information.
- is_head_ready() bool#
Checks if the head node information is available and ready.
- head_info() HeadInfo | None#
Retrieves the head node information from the stored JSON file.
- Returns:
The head node information if available, otherwise None.
- Return type:
Optional[HeadInfo]
- save_head_info(head_info: HeadInfo)#
Saves the head node information to a JSON file.
- Parameters:
head_info (HeadInfo) – The head node information to save.
- clean()#
Removes the rendezvous directory and all its contents.
- add_job(job: submitit.Job)#
Adds a job to the jobs directory by creating a JSON file with the job’s information.
- Parameters:
job (submitit.Job) – The job to add.
- list_job_ids() list[str]#
Lists all job IDs stored in the jobs directory.
- core.launchers.cluster.ray_cluster._ray_head_script(cluster_state: RayClusterState, worker_wait_timeout_seconds: int, payload: Callable[Ellipsis, PayloadReturnT] | None = None, **kwargs)#
Start the head node of the Ray cluster on slurm.
- core.launchers.cluster.ray_cluster.worker_script(cluster_state: RayClusterState, worker_wait_timeout_seconds: int, start_wait_time_seconds: int = 60)#
start an array of worker nodes for the Ray cluster on slurm. Waiting on the head node first.
- class core.launchers.cluster.ray_cluster.RayCluster(log_dir: pathlib.Path = Path('raycluster_logs'), rdv_dir: pathlib.Path | None = None, cluster_id: str | None = None, worker_wait_timeout_seconds: int = 60)#
A RayCluster offers tools to start a Ray cluster (head and wokers) on slurm with the correct settings.
args:
log_dir: Path to the directory where logs will be stored. Defaults to “raycluster_logs” in the working directory. All slurm logs will go there, and it also creates symlinks to the stdout/stderr of each jobs with nicer name (head, worker_0, worker_1, …, driver_0, etc). There interesting logs will be in the driver_N.err file, you should tail that. rdv_dir: Path to the directory where the rendezvous information will be stored. Defaults to ~/.fairray. Useful if you are trying to recover an existing cluster. cluster_id: A unique identifier for the cluster. Defaults to a random UUID. You only want to set this if you want to connect to an existing cluster. worker_wait_timeout_seconds (int): The number of seconds ray will wait for a worker to be ready before giving up. Defaults to 60 seconds. If you are scheduling
workers in a queue that takes time for allocation, you might want to increase this otherwise your ray payload will fail, not finding resources.
- log_dir: pathlib.Path#
- state: RayClusterState#
- jobs: list[submitit.Job] = []#
- is_shutdown = False#
- num_worker_groups = 0#
- num_drivers = 0#
- head_started = False#
- worker_wait_timeout_seconds#
- start_head(requirements: dict[str, int | str], executor: str = 'slurm', payload: Callable[Ellipsis, PayloadReturnT] | None = None, **kwargs) str#
Start the head node of the Ray cluster on slurm. You should do this first. Interesting requirements: qos, partition, time, gpus, cpus-per-task, mem-per-gpu, etc.
- start_workers(num_workers: int, requirements: dict[str, int | str], executor: str = 'slurm') list[str]#
Start an array of worker nodes of the Ray cluster on slurm. You should do this after starting a head. Interesting requirements: qos, partition, time, gpus, cpus-per-task, mem-per-gpu, etc. You can call this multiple times to start an heterogeneous cluster.
- shutdown()#
Cancel all slurms jobs and get rid of rdv directory.
- __enter__()#
- __exit__(exc_type, exc_value, traceback)#