core.datasets.mt_concat_dataset#
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.
Attributes#
Classes#
An abstract class representing a |
Functions#
|
Make a concat dataset with all the splits for each dataset. Keys will be {dataset}.{split} |
Module Contents#
- core.datasets.mt_concat_dataset.T_co#
- class core.datasets.mt_concat_dataset.ConcatDataset(datasets, sampling: dict)#
Bases:
torch.utils.data.Dataset
[T_co
]An abstract class representing a
Dataset
.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__()
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__()
, which is expected to return the size of the dataset by manySampler
implementations and the default options ofDataLoader
. Subclasses could also optionally implement__getitems__()
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoader
by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- static cumsum(sequence, sample_ratios)#
- datasets = []#
- dataset_names = []#
- sample_ratios#
- cumulative_sizes#
- real_sizes#
- __len__()#
- __getitem__(idx)#
- _get_dataset_and_sample_index_list(sample_idxs: list)#
- _get_dataset_and_sample_index(idx: int)#
- property updated_dataset_sizes#
- metadata_hasattr(attr) bool #
- get_metadata(attr, sample_idxs_to_get_metadata_for)#
- static _dataset_sampling(dataset_sizes: list[int], dataset_names: list[str], sampling: dict) list[float] #
Return expansion ratios for each dataset based on sampling strategy
- core.datasets.mt_concat_dataset.create_concat_dataset(dataset_configs: omegaconf.DictConfig, combined_dataset_config: dict) ConcatDataset #
Make a concat dataset with all the splits for each dataset. Keys will be {dataset}.{split}