core.common.distutils#

Copyright (c) Meta Platforms, Inc. and affiliates.

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Attributes#

Functions#

os_environ_get_or_throw(→ str)

get_init_method(init_method, world_size[, rank, ...])

Get the initialization method for a distributed job based on the specified method type.

setup(→ None)

cleanup(→ None)

initialized(→ bool)

get_rank(→ int)

get_world_size(→ int)

is_master(→ bool)

synchronize(→ None)

broadcast(→ None)

broadcast_object_list(→ None)

all_reduce(→ torch.Tensor)

all_gather(→ list[torch.Tensor])

gather_objects(→ list[T])

Gather a list of pickleable objects into rank 0

assign_device_for_local_rank(→ None)

get_device_for_local_rank(→ str)

setup_env_local()

Module Contents#

core.common.distutils.T#
core.common.distutils.DISTRIBUTED_PORT = 13356#
core.common.distutils.CURRENT_DEVICE_TYPE_STR = 'CURRRENT_DEVICE_TYPE'#
core.common.distutils.os_environ_get_or_throw(x: str) str#
core.common.distutils.get_init_method(init_method, world_size: int | None, rank: int | None = None, node_list: str | None = None, filename: str | None = None)#

Get the initialization method for a distributed job based on the specified method type.

Parameters:
  • init_method – The initialization method type, either “tcp” or “file”.

  • world_size – The total number of processes in the distributed job.

  • rank – The rank of the current process (optional).

  • node_list – The list of nodes for SLURM-based distributed job (optional, used with “tcp”).

  • filename – The shared file path for file-based initialization (optional, used with “file”).

Returns:

The initialization method string to be used by PyTorch’s distributed module.

Raises:

ValueError – If an invalid init_method is provided.

core.common.distutils.setup(config) None#
core.common.distutils.cleanup() None#
core.common.distutils.initialized() bool#
core.common.distutils.get_rank() int#
core.common.distutils.get_world_size() int#
core.common.distutils.is_master() bool#
core.common.distutils.synchronize() None#
core.common.distutils.broadcast(tensor: torch.Tensor, src, group=dist.group.WORLD, async_op: bool = False) None#
core.common.distutils.broadcast_object_list(object_list: list[Any], src: int, group=dist.group.WORLD, device: str | None = None) None#
core.common.distutils.all_reduce(data, group=dist.group.WORLD, average: bool = False, device=None) torch.Tensor#
core.common.distutils.all_gather(data, group=dist.group.WORLD, device=None) list[torch.Tensor]#
core.common.distutils.gather_objects(data: T, group: torch.distributed.ProcessGroup = dist.group.WORLD) list[T]#

Gather a list of pickleable objects into rank 0

core.common.distutils.assign_device_for_local_rank(cpu: bool, local_rank: int) None#
core.common.distutils.get_device_for_local_rank() str#
core.common.distutils.setup_env_local()#