OCP Data Preprocessing Tutorial#
This notebook provides an overview of converting ASE Atoms objects to PyTorch Geometric Data objects. To better understand the raw data contained within OC20, check out the following tutorial first: https://github.com/Open-Catalyst-Project/ocp/blob/master/docs/source/tutorials/data_visualization.ipynb
from fairchem.core.preprocessing import AtomsToGraphs
import ase.io
from ase.build import bulk
from ase.build import fcc100, add_adsorbate, molecule
from ase.constraints import FixAtoms
from ase.calculators.emt import EMT
from ase.optimize import BFGS
Generate toy dataset: Relaxation of CO on Cu#
adslab = fcc100("Cu", size=(2, 2, 3))
ads = molecule("CO")
add_adsorbate(adslab, ads, 3, offset=(1, 1))
cons = FixAtoms(indices=[atom.index for atom in adslab if (atom.tag == 3)])
adslab.set_constraint(cons)
adslab.center(vacuum=13.0, axis=2)
adslab.set_pbc(True)
adslab.set_calculator(EMT())
dyn = BFGS(adslab, trajectory="CuCO_adslab.traj", logfile=None)
dyn.run(fmax=0, steps=1000)
/tmp/ipykernel_4481/901556023.py:8: DeprecationWarning: Please use atoms.calc = calc
adslab.set_calculator(EMT())
False
raw_data = ase.io.read("CuCO_adslab.traj", ":")
print(len(raw_data))
1001
Convert Atoms object to Data object#
The AtomsToGraphs class takes in several arguments to control how Data objects created:
max_neigh (int): Maximum number of neighbors a given atom is allowed to have, discarding the furthest
radius (float): Cutoff radius to compute nearest neighbors around
r_energy (bool): Write energy to Data object
r_forces (bool): Write forces to Data object
r_distances (bool): Write distances between neighbors to Data object
r_edges (bool): Write neigbhor edge indices to Data object
r_fixed (bool): Write indices of fixed atoms to Data object
a2g = AtomsToGraphs(
max_neigh=50,
radius=6,
r_energy=True,
r_forces=True,
r_distances=False,
r_edges=True,
r_fixed=True,
)
data_objects = a2g.convert_all(raw_data, disable_tqdm=True)
data = data_objects[0]
data
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 635], cell_offsets=[635, 3], edge_distance_vec=[635, 3], energy=3.9893144106683787, forces=[14, 3], fixed=[14])
data.atomic_numbers
tensor([29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 8, 6],
dtype=torch.uint8)
data.cell
tensor([[[ 5.1053, 0.0000, 0.0000],
[ 0.0000, 5.1053, 0.0000],
[ 0.0000, 0.0000, 32.6100]]])
data.edge_index #neighbor idx, source idx
tensor([[ 1, 2, 2, ..., 5, 6, 3],
[ 0, 0, 0, ..., 13, 13, 13]])
from torch_geometric.utils import degree
# Degree corresponds to the number of neighbors a given node has. Note there is no more than max_neigh neighbors for
# any given node.
degree(data.edge_index[1])
tensor([45., 45., 45., 46., 49., 49., 49., 49., 50., 49., 49., 49., 26., 35.])
data.fixed
tensor([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.int32)
data.forces
tensor([[ 1.7794e-15, 2.2235e-15, 1.1354e-01],
[-8.0838e-16, 1.1120e-15, 1.1344e-01],
[ 2.1459e-15, 1.2733e-15, 1.1344e-01],
[-5.7891e-16, 8.3663e-16, 1.1294e-01],
[-8.5221e-03, -8.5221e-03, -1.1496e-02],
[ 8.5221e-03, -8.5221e-03, -1.1496e-02],
[-8.5221e-03, 8.5221e-03, -1.1496e-02],
[ 8.5221e-03, 8.5221e-03, -1.1496e-02],
[-1.6723e-15, -1.1735e-15, -1.0431e-01],
[ 3.9409e-16, -1.5543e-15, -6.6610e-02],
[-3.4001e-15, -8.4849e-17, -6.6610e-02],
[ 1.8858e-15, 9.3691e-16, -3.3250e-01],
[-4.4046e-20, -4.4046e-20, -3.4247e-01],
[ 8.7549e-18, -5.1229e-18, 5.0512e-01]])
data.pos
tensor([[ 0.0000, 0.0000, 13.0000],
[ 2.5527, 0.0000, 13.0000],
[ 0.0000, 2.5527, 13.0000],
[ 2.5527, 2.5527, 13.0000],
[ 1.2763, 1.2763, 14.8050],
[ 3.8290, 1.2763, 14.8050],
[ 1.2763, 3.8290, 14.8050],
[ 3.8290, 3.8290, 14.8050],
[ 0.0000, 0.0000, 16.6100],
[ 2.5527, 0.0000, 16.6100],
[ 0.0000, 2.5527, 16.6100],
[ 2.5527, 2.5527, 16.6100],
[ 2.5527, 2.5527, 19.6100],
[ 2.5527, 2.5527, 18.4597]])
data.energy
3.9893144106683787
Adding additional info to your Data objects#
In addition to the above information, the OCP repo requires several other pieces of information for your data to work with the provided trainers:
sid (int): A unique identifier for a particular system. Does not affect your model performance, used for prediction saving
fid (int) (S2EF only): If training for the S2EF task, your data must also contain a unique frame identifier for atoms objects coming from the same system.
tags (tensor): Tag information - 0 for subsurface, 1 for surface, 2 for adsorbate. Optional, can be used for training.
Other information may be added her as well if you choose to incorporate other information in your models/frameworks
data_objects = []
for idx, system in enumerate(raw_data):
data = a2g.convert(system)
data.fid = idx
data.sid = 0 # All data points come from the same system, arbitrarly define this as 0
data_objects.append(data)
data = data_objects[100]
data
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 638], cell_offsets=[638, 3], edge_distance_vec=[638, 3], energy=3.9683558933958047, forces=[14, 3], fixed=[14], fid=100, sid=0)
data.sid
0
data.fid
100
Resources:
https://github.com/Open-Catalyst-Project/ocp/blob/6604e7130ea41fabff93c229af2486433093e3b4/ocpmodels/preprocessing/atoms_to_graphs.py
https://github.com/Open-Catalyst-Project/ocp/blob/master/scripts/preprocess_ef.py