core.datasets.mt_concat_dataset

core.datasets.mt_concat_dataset#

Copyright (c) Meta Platforms, Inc. and affiliates.

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Attributes#

T_co

Classes#

ConcatDataset

An abstract class representing a Dataset.

Functions#

create_concat_dataset(→ ConcatDataset)

Make a concat dataset with all the splits for each dataset. Keys will be {dataset}.{split}

Module Contents#

core.datasets.mt_concat_dataset.T_co#

class core.datasets.mt_concat_dataset.ConcatDataset(datasets, sampling: dict)#

Bases: torch.utils.data.Dataset[T_co]

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

static cumsum(sequence, sample_ratios)#

datasets = []#

dataset_names = []#

sample_ratios#

cumulative_sizes#

real_sizes#

__len__()#

__getitem__(idx)#

_get_dataset_and_sample_index_list(sample_idxs: list)#

_get_dataset_and_sample_index(idx: int)#

property updated_dataset_sizes#

metadata_hasattr(attr) → bool#

get_metadata(attr, sample_idxs_to_get_metadata_for)#

static _dataset_sampling(dataset_sizes: list[int], dataset_names: list[str], sampling: dict) → list[float]#: Return expansion ratios for each dataset based on sampling strategy

core.datasets.mt_concat_dataset.create_concat_dataset(dataset_configs: omegaconf.DictConfig, combined_dataset_config: dict) → ConcatDataset#: Make a concat dataset with all the splits for each dataset. Keys will be {dataset}.{split}