Making LMDB Datasets (original format, deprecated for ASE LMDBs)#
Storing your data in an LMDB ensures very fast random read speeds for the fastest supported throughput. This was the recommended option for the majority of fairchem use cases, but has since been deprecated for ASE LMDB files
This notebook provides an overview of how to create LMDB datasets to be used with the FAIRChem repo. This tutorial is intended for those who wish to use FAIRChem to train on their own datasets. Those interested in just using FAIRChem data need not worry about these steps as they’ve been automated as part of this download script.
from fairchem.core.preprocessing import AtomsToGraphs
from fairchem.core.datasets import LmdbDataset
import ase.io
from ase.build import bulk
from ase.build import fcc100, add_adsorbate, molecule
from ase.constraints import FixAtoms
from ase.calculators.emt import EMT
from ase.optimize import BFGS
import matplotlib.pyplot as plt
import lmdb
import pickle
from tqdm import tqdm
import torch
import os
Generate toy dataset: Relaxation of CO on Cu#
adslab = fcc100("Cu", size=(2, 2, 3))
ads = molecule("CO")
add_adsorbate(adslab, ads, 3, offset=(1, 1))
cons = FixAtoms(indices=[atom.index for atom in adslab if (atom.tag == 3)])
adslab.set_constraint(cons)
adslab.center(vacuum=13.0, axis=2)
adslab.set_pbc(True)
adslab.set_calculator(EMT())
dyn = BFGS(adslab, trajectory="CuCO_adslab.traj", logfile=None)
dyn.run(fmax=0, steps=1000)
/tmp/ipykernel_2807/901556023.py:8: DeprecationWarning: Please use atoms.calc = calc
adslab.set_calculator(EMT())
False
raw_data = ase.io.read("CuCO_adslab.traj", ":")
len(raw_data)
1001
Initial Structure to Relaxed Energy/Structure (IS2RE/IS2RS) LMDBs#
IS2RE/IS2RS LMDBs utilize the SinglePointLmdb dataset. This dataset expects the data to be contained in a SINGLE LMDB file. In addition to the attributes defined by AtomsToGraph, the following attributes must be added for the IS2RE/IS2RS tasks:
pos_relaxed: Relaxed adslab positions
sid: Unique system identifier, arbitrary
y_init: Initial adslab energy, formerly Data.y
y_relaxed: Relaxed adslab energy
tags (optional): 0 - subsurface, 1 - surface, 2 - adsorbate
As a demo, we will use the above generated data to create an IS2R* LMDB file.
Initialize AtomsToGraph feature extractor#
a2g = AtomsToGraphs(
max_neigh=50,
radius=6,
r_energy=True, # False for test data
r_forces=True, # False for test data
r_distances=False,
r_fixed=True,
)
Initialize LMDB file#
db = lmdb.open(
"sample_CuCO.lmdb",
map_size=1099511627776 * 2,
subdir=False,
meminit=False,
map_async=True,
)
Write data to LMDB#
def read_trajectory_extract_features(a2g, traj_path):
traj = ase.io.read(traj_path, ":")
tags = traj[0].get_tags()
images = [traj[0], traj[-1]]
data_objects = a2g.convert_all(images, disable_tqdm=True)
data_objects[0].tags = torch.LongTensor(tags)
data_objects[1].tags = torch.LongTensor(tags)
return data_objects
system_paths = ["CuCO_adslab.traj"]
idx = 0
for system in system_paths:
# Extract Data object
data_objects = read_trajectory_extract_features(a2g, system)
initial_struc = data_objects[0]
relaxed_struc = data_objects[1]
initial_struc.y_init = initial_struc.y # subtract off reference energy, if applicable
del initial_struc.y
initial_struc.y_relaxed = relaxed_struc.y # subtract off reference energy, if applicable
initial_struc.pos_relaxed = relaxed_struc.pos
# Filter data if necessary
# FAIRChem filters adsorption energies > |10| eV
initial_struc.sid = idx # arbitrary unique identifier
# no neighbor edge case check
if initial_struc.edge_index.shape[1] == 0:
print("no neighbors", traj_path)
continue
# Write to LMDB
txn = db.begin(write=True)
txn.put(f"{idx}".encode("ascii"), pickle.dumps(initial_struc, protocol=-1))
txn.commit()
db.sync()
idx += 1
db.close()
dataset = LmdbDataset({"src": "sample_CuCO.lmdb"})
len(dataset)
1
dataset[0]
/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
return torch.load(io.BytesIO(b))
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 635], cell_offsets=[635, 3], edge_distance_vec=[635, 3], energy=3.9893144106683787, forces=[14, 3], fixed=[14], pos_relaxed=[14, 3], sid=0)
Structure to Energy and Forces (S2EF) LMDBs#
S2EF LMDBs utilize the TrajectoryLmdb dataset. This dataset expects a directory of LMDB files. In addition to the attributes defined by AtomsToGraph, the following attributes must be added for the S2EF task:
tags (optional): 0 - subsurface, 1 - surface, 2 - adsorbate
fid: Frame index along the trajcetory
sid- sid: Unique system identifier, arbitrary
Additionally, a “length” key must be added to each LMDB file.
As a demo, we will use the above generated data to create an S2EF LMDB dataset
os.makedirs("s2ef", exist_ok=True)
db = lmdb.open(
"s2ef/sample_CuCO.lmdb",
map_size=1099511627776 * 2,
subdir=False,
meminit=False,
map_async=True,
)
tags = raw_data[0].get_tags()
data_objects = a2g.convert_all(raw_data, disable_tqdm=True)
for fid, data in tqdm(enumerate(data_objects), total=len(data_objects)):
#assign sid
data.sid = torch.LongTensor([0])
#assign fid
data.fid = torch.LongTensor([fid])
#assign tags, if available
data.tags = torch.LongTensor(tags)
# Filter data if necessary
# FAIRChem filters adsorption energies > |10| eV and forces > |50| eV/A
# no neighbor edge case check
if data.edge_index.shape[1] == 0:
print("no neighbors", traj_path)
continue
txn = db.begin(write=True)
txn.put(f"{fid}".encode("ascii"), pickle.dumps(data, protocol=-1))
txn.commit()
txn = db.begin(write=True)
txn.put(f"length".encode("ascii"), pickle.dumps(len(data_objects), protocol=-1))
txn.commit()
db.sync()
db.close()
0%| | 0/1001 [00:00<?, ?it/s]
5%|▍ | 50/1001 [00:00<00:01, 498.69it/s]
11%|█ | 111/1001 [00:00<00:01, 560.28it/s]
17%|█▋ | 173/1001 [00:00<00:01, 586.28it/s]
23%|██▎ | 232/1001 [00:00<00:01, 580.82it/s]
29%|██▉ | 292/1001 [00:00<00:01, 587.32it/s]
35%|███▌ | 352/1001 [00:00<00:01, 590.47it/s]
41%|████▏ | 415/1001 [00:00<00:00, 603.04it/s]
48%|████▊ | 477/1001 [00:00<00:00, 607.65it/s]
54%|█████▎ | 538/1001 [00:00<00:00, 608.33it/s]
60%|█████▉ | 600/1001 [00:01<00:00, 610.73it/s]
66%|██████▋ | 664/1001 [00:01<00:00, 617.89it/s]
73%|███████▎ | 726/1001 [00:01<00:00, 618.34it/s]
79%|███████▉ | 792/1001 [00:01<00:00, 629.54it/s]
86%|████████▌ | 858/1001 [00:01<00:00, 634.75it/s]
92%|█████████▏| 922/1001 [00:01<00:00, 628.04it/s]
98%|█████████▊| 985/1001 [00:01<00:00, 627.55it/s]
100%|██████████| 1001/1001 [00:01<00:00, 609.76it/s]
dataset = LmdbDataset({"src": "s2ef/"})
len(dataset)
1001
dataset[0]
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 635], cell_offsets=[635, 3], edge_distance_vec=[635, 3], energy=3.9893144106683787, forces=[14, 3], fixed=[14], sid=[1], fid=[1], id='0_0')
Advanced usage#
LmdbDataset supports multiple LMDB files because the need to highly parallelize the dataset construction process. With FAIRChem’s largest split containing 135M+ frames, the need to parallelize the LMDB generation process for these was necessary. If you find yourself needing to deal with very large datasets we recommend parallelizing this process.
Interacting with the LMDBs#
Below we demonstrate how to interact with an LMDB to extract particular information.
dataset = LmdbDataset({"src": "s2ef/"})
data = dataset[0]
data
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 635], cell_offsets=[635, 3], edge_distance_vec=[635, 3], energy=3.9893144106683787, forces=[14, 3], fixed=[14], sid=[1], fid=[1], id='0_0')
energies = torch.tensor([data.energy for data in dataset])
energies
tensor([3.9893, 3.9835, 3.9784, ..., 3.9684, 3.9684, 3.9684])
plt.hist(energies, bins = 10)
plt.yscale("log")
plt.xlabel("Energies")
plt.show()