Open Catalyst 2020 (OC20)#
NOTE: Data files for all tasks / splits were updated on Feb 10, 2021 due to minor bugs (affecting < 1% of the data) in earlier versions. If you downloaded data before Feb 10, 2021, please re-download the data.
Download and preprocess the dataset#
IS2* datasets are stored as LMDB files and are ready to be used upon download. S2EF train+val datasets require an additional preprocessing step.
For convenience, a self-contained script can be found here to download, preprocess, and organize the data directories to be readily usable by the existing configs.
For IS2*, run the script as:
python scripts/download_data.py --task is2re
For S2EF train/val, run the script as:
python scripts/download_data.py --task s2ef --split SPLIT_SIZE --get-edges --num-workers WORKERS --ref-energy
--split
: split size to download:"200k", "2M", "20M", "all", "val_id", "val_ood_ads", "val_ood_cat", or "val_ood_both"
.--get-edges
: includes edge information in LMDBs (~10x storage requirement, ~3-5x slowdown), otherwise, compute edges on the fly (larger GPU memory requirement).--num-workers
: number of workers to parallelize preprocessing across.--ref-energy
: uses referenced energies instead of raw energies.
For S2EF test, run the script as:
python scripts/download_data.py --task s2ef --split test
To download and process the dataset in a directory other than your local fairchem/data
folder, add the following command line argument --data-path
.
Note that the baseline configs
expect the data to be found in fairchem/data
, make sure you symlink your directory or
modify the paths in the configs accordingly.
The following sections list dataset download links and sizes for various S2EF
and IS2RE/IS2RS task splits. If you used the above download_data.py
script to
download and preprocess the data, you are good to go and can stop reading here!
Structure to Energy and Forces (S2EF) task#
For this task’s train and validation sets, we provide compressed trajectory files with the input structures and output energies and forces. We provide precomputed LMDBs for the test sets. To use the train and validation datasets, first download the files and uncompress them. The uncompressed files are used to generate LMDBs, which are in turn used by the dataloaders to train the ML models. Code for the dataloaders and generating the LMDBs may be found in the Github repository.
Four training datasets are provided with different sizes. Each is a subset of the other, i.e., the 2M dataset is contained in the 20M and all datasets.
Four datasets are provided for validation set. Each dataset corresponds to a subsplit used to evaluate different types of extrapolation, in domain (id, same distribution as the training dataset), out of domain adsorbate (ood_ads, unseen adsorbate), out of domain catalyst (ood_cat, unseen catalyst composition), and out of domain both (ood_both, unseen adsorbate and catalyst composition).
For the test sets, we provide precomputed LMDBs for each of the 4 subsplits (In Domain, OOD Adsorbate, OOD Catalyst, OOD Both).
Each tarball has a README file containing details about file formats, number of structures / trajectories, etc.
Splits |
Size of compressed version (in bytes) |
Size of uncompressed version (in bytes) |
MD5 checksum (download link) |
---|---|---|---|
Train |
|||
all |
225G |
1.1T |
|
20M |
34G |
165G |
|
2M |
3.4G |
17G |
|
200K |
344M |
1.7G |
|
Validation |
|||
val_id |
1.7G |
8.3G |
|
val_ood_ads |
1.7G |
8.2G |
|
val_ood_cat |
1.7G |
8.3G |
|
val_ood_both |
1.9G |
9.5G |
|
Test (LMDBs for all splits) |
30G |
415G |
|
Rattled data |
29G |
136G |
|
MD data |
42G |
306G |
|
Initial Structure to Relaxed Structure (IS2RS) and Initial Structure to Relaxed Energy (IS2RE) tasks#
For the IS2RS and IS2RE tasks, we are providing:
One
.tar.gz
file with precomputed LMDBs which once downloaded and uncompressed, can be used directly to train ML models. The LMDBs contain the input initial structures and the output relaxed structures and energies. Training datasets are split by size, with each being a subset of the larger splits, similar to S2EF. The validation and test datasets are broken into subsplits based on different extrapolation evaluations (In Domain, OOD Adsorbate, OOD Catalyst, OOD Both).underlying ASE relaxation trajectories for the adsorbate+catalyst in the entire training and validation sets for the IS2RE and IS2RS tasks. These are not required to download for training ML models, but are available for interested users.
Each tarball has README file containing details about file formats, number of structures / trajectories, etc.
Splits |
Size of compressed version (in bytes) |
Size of uncompressed version (in bytes) |
MD5 checksum (download link) |
---|---|---|---|
Train (all splits) + Validation (all splits) + test (all splits) |
8.1G |
97G |
|
Test-challenge 2021 (challenge details) |
1.3G |
17G |
|
Relaxation Trajectories#
Adsorbate+catalyst system trajectories (optional download)#
Split |
Size of compressed version (in bytes) |
Size of uncompressed version (in bytes) |
MD5 checksum (download link) |
---|---|---|---|
All IS2RE/S training (~466k trajectories) |
109G |
841G |
|
IS2RE/S Validation |
|||
val_id (~25K trajectories) |
5.9G |
46G |
|
val_ood_ads (~25K trajectories) |
5.7G |
44G |
|
val_ood_cat (~25K trajectories) |
6.0G |
46G |
|
val_ood_both (~25K trajectories) |
4.4G |
35G |
Per-adsorbate trajectories (optional download)#
Download links are in the table below:
Adsorbate symbol |
Downloadable path |
size |
MD5 checksum |
---|---|---|---|
*O |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/0.tar |
1006M |
d4151542856b4b6405f276808f75358a |
*H |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/1.tar |
850M |
3697f04faf04251a23da8b88a78209f7 |
*OH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/2.tar |
1.6G |
a21081f3f55eb0c98a91021bbe3dac44 |
*OH2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/3.tar |
1.8G |
b12b706854f5d899e02a9ae6578b5d45 |
*C |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/4.tar |
1.1G |
e4fe9890764fcf59e01e3ceab089b978 |
*CH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/6.tar |
1.4G |
ec9aa2c4c4bd4419359438ba7fbb881d |
*CHO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/7.tar |
1.4G |
d32200f74ad5c3bfd42e8835f36d57ab |
*COH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/8.tar |
1.6G |
5418a1b331f6c7689a5405cca4cc8d15 |
*CH2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/9.tar |
1.6G |
8ee1066149c305d7c17c219b369c5a73 |
CH2O |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/10.tar |
1.7G |
960c2450814024b66f3c79121179ac60 |
*CHOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/11.tar |
1.8G |
60ac9f965f9589a3389483e3d1e58144 |
*CH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/12.tar |
1.7G |
7e123e6f4fb10d6897be3f47721dfd4a |
*OCH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/13.tar |
1.8G |
0823047bbbe05fa0e63f9d83ec601487 |
*CH2OH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/14.tar |
1.9G |
9ac71e198d75b1427182cd34abb73e4d |
*CH4 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/15.tar |
1.9G |
a405ce403018bf8afbd4425d5c0b34d5 |
*OHCH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/16.tar |
2.1G |
d3c829f1952db6e4f428273ee05f59b1 |
CC |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/17.tar |
1.5G |
d687a151345305897b9245af4b0f9967 |
*CCO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/18.tar |
1.7G |
214ca96e620c5ec6e8a6ff8144a22a04 |
*CCH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/19.tar |
1.6G |
da2268545e80ca1664026449dd2fdd24 |
*CHCO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/20.tar |
1.7G |
386c99407fe63080d26cda525dfdd8cd |
*CCHO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/21.tar |
1.8G |
918b20960438494ab160a9dbd9668157 |
*COCHO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/22.tar |
1.8G |
84424aa2ad30301e23ece1438ea39923 |
*CCHOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/23.tar |
2.0G |
3cc90425ec042a70085ba7eb2916a79a |
*CCH2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/24.tar |
1.8G |
9dbcf7566e40965dd7f8a186a75a718e |
CHCH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/25.tar |
1.7G |
a193b4c72f915ba0b21a41790696b23c |
CH2*CO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/26.tar |
1.8G |
de83cf50247f5556fa4f9f64beff1eeb |
*CHCHO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/27.tar |
1.9G |
1d140aaa2e7b287124ab38911a711d70 |
CHCOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/28.tar |
1.3G |
682d8a6b05ca5948b34dc5e5f6bbcd61 |
*COCH2O |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/29.tar |
1.9G |
c8742faa8ca40e8edb4110069817fa70 |
CHOCHO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/30.tar |
2.0G |
8cfbb67beb312b98c40fcb891dfa480a |
*COHCHO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/31.tar |
1.9G |
6ffa903a62d8ec3319ecec6a03b06276 |
*COHCOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/32.tar |
2.0G |
caca0058b641bfdc9f8de4527e60feb7 |
*CCH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/33.tar |
1.8G |
906543aaefc171edab388ff4f0fe8a20 |
*CHCH2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/34.tar |
1.8G |
4dfab479495f76179749c1956046fbd8 |
*COCH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/35.tar |
1.9G |
29d1b992715054e920e8bb2afe97b393 |
*CHCHOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/38.tar |
2.0G |
9e5912df6f7b11706d1046cdb9e3087e |
*CCH2OH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/39.tar |
2.1G |
7bcae43cee451306e34ec416588a7f09 |
*CHOCHOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/40.tar |
2.0G |
f98866d08fe3451ae7ebc47bb51599aa |
*COCH2OH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/41.tar |
1.4G |
bfaf689e5827fcf26c51e567bb8dd1be |
*COHCHOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/42.tar |
2.0G |
236fe4e950aa2fbdde94ef2821fb48d2 |
*OCHCH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/44.tar |
2.1G |
66acc5460a999625c3364f0f3bcca871 |
*COHCH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/45.tar |
2.1G |
bb4a01956736399c8cee5e219f8c1229 |
*CHOHCH2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/46.tar |
2.1G |
e836de4ec146b1b611533f1ef682cace |
*CHCH2OH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/47.tar |
2.0G |
66df44121806debef6dc038df7115d1d |
*OCH2CHOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/48.tar |
2.2G |
ff6981fdbcd2e65d351505c15d218d76 |
*CHOCH2OH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/49.tar |
2.1G |
448f7d352ab6e32f754e24de64ca302a |
*COHCH2OH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/50.tar |
2.1G |
8bff6bf3e10cc84acc4a283a375fcc23 |
*CHOHCHOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/51.tar |
2.0G |
9c9e4d617d306751760a80f1453e71f1 |
*CH2CH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/52.tar |
2.0G |
ec1e964d2ee6f468fa5773743e3994a4 |
*OCH2CH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/53.tar |
2.1G |
d297b27b02822f9b6af80bdb64aee819 |
*CHOHCH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/54.tar |
2.1G |
368de083dafdc3bbdb560d35e2a102c0 |
*CH2CH2OH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/55.tar |
2.1G |
3c1aaf790659f7ff89bf1eed8b396b63 |
*CHOHCH2OH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/56.tar |
2.2G |
2d71adb9e305e6f3bca49e5df9b5a86a |
*OHCH2CH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/57.tar |
2.3G |
cf51128f8522b7b66fc68d79980d6def |
*NH2N(CH3)2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/58.tar |
1.6G |
36ba974d80c20ff636431f7c0ad225da |
*ONN(CH3)2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/59.tar |
2.3G |
fdc4cd19977496909d61be4aee61c4f1 |
*OHNNCH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/60.tar |
2.1G |
50a6ff098f9ba7adbba9ac115726cc5a |
*ONH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/62.tar |
1.8G |
47573199c545afe46c554ff756c3e38f |
*NHNH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/63.tar |
1.7G |
dd456b7e19ef592d9f0308d911b91d7c |
NNH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/65.tar |
1.6G |
c05289fd56d64c74306ebf57f1061318 |
*NO2NO2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/67.tar |
2.1G |
4822a06f6c5f41bdefd3cbbd8856c11f |
NNO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/68.tar |
1.6G |
2a27de122d32917cc5b6ac0a21c63c1c |
*N2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/69.tar |
1.5G |
cc668fecf679b6edaac8fd8fb9cdd404 |
*ONNH2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/70.tar |
2.1G |
dff880f1a5baa7f67b52fd3ed745443d |
*NH2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/71.tar |
1.6G |
c7f383b50faa6244e265c9611466cb8f |
*NH3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/72.tar |
1.9G |
2b355741f9300445703270e0e4b8c01c |
*NONH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/73.tar |
1.8G |
48877a0c6f2994baac82cb722711aaa2 |
*NH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/74.tar |
1.4G |
7979b9e7ab557d6979b33e352486f0ef |
*NO2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/75.tar |
1.7G |
9f352fbc32bb2b8caf4788aba28b2eb7 |
*NO |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/76.tar |
1.4G |
482ee306a5ae2eee78cac40d10059ebc |
*N |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/77.tar |
1.1G |
bfb6e03d4a687987ff68976f0793cc46 |
*NO3 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/78.tar |
1.8G |
700834326e789a6e38bf3922d9fcb792 |
*OHNH2 |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/79.tar |
2.1G |
fa24472e0c02c34d91f3ffe6b77bfb11 |
*ONOH |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/80.tar |
1.4G |
4ddcccd62a834a76fe6167461f512529 |
*CN |
https://dl.fbaipublicfiles.com/opencatalystproject/data/per_adsorbate_is2res/81.tar |
1.5G |
bc7c55330ece006d09496a5ff01d5d50 |
Note - A few adsorbates are intentionally left out for the test splits.
Downloading any of the above and extracting will result in a folder :
<index>/
system.txt
Text file containing information about the different adsorbate+catalyst system names. In total there are N systems. More details described below.<index>/
This contains N compressed trajectory files of the format
.extxyz.xz
.Files are named as
<system_id>.extxyz.xz
(wheresystem_id
is defined below).
where, <index>
can be 0 to 81. N is dependent on which adsorbate index is chosen.
The file system.txt
has information in the following format:
system_id,reference_energy
where:
system_id
- Internal random ID corresponding to an adsorbate+catalyst system.reference_energy
- Energy used to reference system energies to bare catalyst+gas reference energies. Used for adsorption energy calculations.
The .extxyz.xz
files are LZMA compressed .extxyz
trajectory files. Each trajectory corresponds to a relaxation trajectory of a different adsorbate+catalyst system. Information about the .extxyz
trajectory file format may be found at https://wiki.fysik.dtu.dk/ase/dev/ase/io/formatoptions.html#extxyz .
In order to uncompress the files, uncompress.py
provides a multi-core implementation which could be used.
Catalyst system trajectories (optional download)#
Number |
Size of compressed version (in bytes) |
Size of uncompressed version (in bytes) |
MD5 checksum (download link) |
---|---|---|---|
294k systems |
20G |
151G |
Bader charge data#
We provide Bader charge data for all final frames of our train + validation systems in OC20 (for both S2EF and IS2RE/RS tasks). A .tar.gz
file, when downloaded and uncompressed will contain several directories with unique system-ids (of the format random<XYZ>
where XYZ
is an integer). Each directory will contain raw Bader charge analysis outputs. For more details on the Bader charge analysis, see https://theory.cm.utexas.edu/henkelman/research/bader/.
Downloadable link: https://dl.fbaipublicfiles.com/opencatalystproject/data/oc20_bader_data.tar (MD5 checksum: aecc5e23542de49beceb4b7e44c153b9
)
OC20 mappings#
Data mapping information#
We provide a Python pickle file containing information about the slab and adsorbates for each of the systems in OC20 dataset. Loading the pickle file will load a Python dictionary. The keys of this dictionary are the adsorbate+catalyst system-ids (of the format random<XYZ>
where XYZ
is an integer), and the corresponding value of each key is a dictionary with information about:
bulk_mpid
: Materials Project ID of the bulk system used corresponding the the catalyst surfacebulk_symbols
Chemical composition of the bulk counterpartads_symbols
Chemical composition of the adsorbate counterpartads_id
: internal unique identifier, one for each of the 82 adsorbates used in the datasetbulk_id
: internal unique identifier one for each of the 11500 bulks used in the datasetmiller_index
: 3-tuple of integers indicating the Miller indices of the surfaceshift
: c-direction shift used to determine cutoff for the surface (c-direction is following the nomenclature from Pymatgen)top
: boolean indicating whether the chosen surface was at the top or bottom of the originally enumerated surfaceadsorption_site
: A tuple of 3-tuples containing the Cartesian coordinates of each binding adsorbate atomclass
: integer indicating the class of materials the system’s slab is part of, where:0 - intermetallics
1 - metalloids
2 - non-metals
3 - halides
anomaly
: integer indicating possible anomalies (based off general heuristics, not to be taken as perfect classifications), where:0 - no anomaly
1 - adsorbate dissociation
2 - adsorbate desorption
3 - surface reconstruction
4 - incorrect CHCOH placement, appears to be CHCO with a lone, uninteracting, H far off in the unit cell
Downloadable link: https://dl.fbaipublicfiles.com/opencatalystproject/data/oc20_data_mapping.pkl (MD5 checksum: 01c879067a05b4288055a1fdf821e068
)
An example entry is
'random2181546': {'bulk_id': 6510,
'ads_id': 69,
'bulk_mpid': 'mp-22179',
'bulk_symbols': 'Si2Ti2Y2',
'ads_symbols': '*N2',
'miller_index': (2, 0, 1),
'shift': 0.145,
'top': True,
'adsorption_site': ((4.5, 12.85, 16.13),),
'class': 1,
'anomaly': 0}
Adsorbate-catalyst system to catalyst system mapping information#
We provide a Python pickle file containing information about the mapping from adsorbate-catalyst systems to their corresponding catalyst systems. Loading the pickle file will load a Python dictionary. The keys of this dictionary are the adsorbate+catalyst system-ids (of the format random<XYZ>
where XYZ
is an integer), and values will be the catalyst system-ids (of the format random<PQR>
where PQR
is an integer).
Downloadable link: https://dl.fbaipublicfiles.com/opencatalystproject/data/mapping_adslab_slab.pkl (MD5 checksum: 079041076c3f15d18ecb5d17c509cdfe
)
An example entry is
'random1981709': 'random533137'
Dataset changelog#
September 2021#
Released IS2RE
test-challenge
data for the Open Catalyst Challenge 2021
March 2021#
Modified the pickle corresponding to data mapping information. Now the pickle includes extra information about
miller_index
,shift
,top
andadsorption_site
.Added Molecular Dynamics (MD) and rattled data for S2EF task.
Version 2, Feb 2021#
Modifications:
Removed slab systems which had single frame checkpoints, this led to modifications of reference frame energies of 350k frames out of 130M.
Fixed stitching of checkpoints in adsorbate+catalyst trajectories.
Added release of slab trajectories.
Below are actual updates numbers, of the form old
→ new
Total S2EF frames:
train: 133953162 → 133934018
validation:
val_id : 1000000 → 999866
val_ood_ads: 1000000 → 999838
val_ood_cat: 1000000 → 999809
val_ood_both: 1000000 → 999944
test:
test_id: 1000000 → 999736
test_ood_ads: 1000000 → 999859
test_ood_cat: 1000000 → 999826
test_ood_both: 1000000 → 999973
Total IS2RE and IS2RS systems:
train: 461313 → 460328
validation:
val_id : 24946 → 24943
val_ood_ads: 24966 → 24961
val_ood_cat: 24988 → 24963
val_ood_both: 24963 → 24987
test:
test_id: 24951 → 24948
test_ood_ads: 24931 → 24930
test_ood_cat: 24967 → 24965
test_ood_both: 24986 → 24985
Version 1, Oct 2020#
Total S2EF frames:
train: 133953162
validation:
val_id : 1000000
val_ood_ads: 1000000
val_ood_cat: 1000000
val_ood_both: 1000000
test:
test_id: 1000000
test_ood_ads: 1000000
test_ood_cat: 1000000
test_ood_both: 1000000
Total IS2RE and IS2RS systems:
train: 461313
validation:
val_id : 24936
val_ood_ads: 24966
val_ood_cat: 24988
val_ood_both: 24963
test:
test_id: 24951
test_ood_ads: 24931
test_ood_cat: 24967
test_ood_both: 24986
Citing OC20#
The Open Catalyst 2020 (OC20) dataset is licensed under a Creative Commons Attribution 4.0 License.
Please consider citing the following paper in any research manuscript using the OC20 dataset:
@article{ocp_dataset,
author = {Chanussot*, Lowik and Das*, Abhishek and Goyal*, Siddharth and Lavril*, Thibaut and Shuaibi*, Muhammed and Riviere, Morgane and Tran, Kevin and Heras-Domingo, Javier and Ho, Caleb and Hu, Weihua and Palizhati, Aini and Sriram, Anuroop and Wood, Brandon and Yoon, Junwoong and Parikh, Devi and Zitnick, C. Lawrence and Ulissi, Zachary},
title = {Open Catalyst 2020 (OC20) Dataset and Community Challenges},
journal = {ACS Catalysis},
year = {2021},
doi = {10.1021/acscatal.0c04525},
}
Per-adsorbate trajectories#
Adsorbate symbol |
Size |
MD5 checksum (download link) |
---|---|---|
*O |
1006M |
|
*H |
850M |
|
*OH |
1.6G |
|
*OH2 |
1.8G |
|
*C |
1.1G |
|
*CH |
1.4G |
|
*CHO |
1.4G |
|
*COH |
1.6G |
|
*CH2 |
1.6G |
|
CH2O |
1.7G |
|
*CHOH |
1.8G |
|
*CH3 |
1.7G |
|
*OCH3 |
1.8G |
|
*CH2OH |
1.9G |
|
*CH4 |
1.9G |
|
*OHCH3 |
2.1G |
|
CC |
1.5G |
|
*CCO |
1.7G |
|
*CCH |
1.6G |
|
*CHCO |
1.7G |
|
*CCHO |
1.8G |
|
*COCHO |
1.8G |
|
*CCHOH |
2.0G |
|
*CCH2 |
1.8G |
|
CHCH |
1.7G |
|
CH2*CO |
1.8G |
|
*CHCHO |
1.9G |
|
CHCOH |
1.3G |
|
*COCH2O |
1.9G |
|
CHOCHO |
2.0G |
|
*COHCHO |
1.9G |
|
*COHCOH |
2.0G |
|
*CCH3 |
1.8G |
|
*CHCH2 |
1.8G |
|
*COCH3 |
1.9G |
|
*CHCHOH |
2.0G |
|
*CCH2OH |
2.1G |
|
*CHOCHOH |
2.0G |
|
*COCH2OH |
1.4G |
|
*COHCHOH |
2.0G |
|
*OCHCH3 |
2.1G |
|
*COHCH3 |
2.1G |
|
*CHOHCH2 |
2.1G |
|
*CHCH2OH |
2.0G |
|
*OCH2CHOH |
2.2G |
|
*CHOCH2OH |
2.1G |
|
*COHCH2OH |
2.1G |
|
*CHOHCHOH |
2.0G |
|
*CH2CH3 |
2.0G |
|
*OCH2CH3 |
2.1G |
|
*CHOHCH3 |
2.1G |
|
*CH2CH2OH |
2.1G |
|
*CHOHCH2OH |
2.2G |
|
*OHCH2CH3 |
2.3G |
|
*NH2N(CH3)2 |
1.6G |
|
*ONN(CH3)2 |
2.3G |
|
*OHNNCH3 |
2.1G |
|
*ONH |
1.8G |
|
*NHNH |
1.7G |
|
NNH |
1.6G |
|
*NO2NO2 |
2.1G |
|
NNO |
1.6G |
|
*N2 |
1.5G |
|
*ONNH2 |
2.1G |
|
*NH2 |
1.6G |
|
*NH3 |
1.9G |
|
*NONH |
1.8G |
|
*NH |
1.4G |
|
*NO2 |
1.7G |
|
*NO |
1.4G |
|
*N |
1.1G |
|
*NO3 |
1.8G |
|
*OHNH2 |
2.1G |
|
*ONOH |
1.4G |
|
*CN |
1.5G |
Note - A few adsorbates are intentionally left out for the test splits.
Downloading any of the above and extracting will result in a folder:
<index>/
system.txt
Text file containing information about the different adsorbate+catalyst system names. In total there are N systems. More details described below.<index>/
This contains N compressed trajectory files of the format
.extxyz.xz
.Files are named as
<system_id>.extxyz.xz
(wheresystem_id
is defined below).
where, <index>
can be 0 to 81. N is dependent on which adsorbate index is chosen.
The file system.txt
has information in the following format:
system_id,reference_energy
where:
system_id
- Internal random ID corresponding to an adsorbate+catalyst system.reference_energy
- Energy used to reference system energies to bare catalyst+gas reference energies. Used for adsorption energy calculations.
The .extxyz.xz
files are LZMA compressed .extxyz
trajectory files. Each trajectory corresponds to a relaxation trajectory of a different adsorbate+catalyst system. Information about the .extxyz
trajectory file format may be found at https://wiki.fysik.dtu.dk/ase/ase/io/formatoptions.html#extxyz.
In order to uncompress the files, uncompress.py
provides a multi-core implementation which could be used.