OCP Data Preprocessing Tutorial#

This notebook provides an overview of converting ASE Atoms objects to PyTorch Geometric Data objects. To better understand the raw data contained within OC20, check out the following tutorial first: https://github.com/Open-Catalyst-Project/ocp/blob/master/docs/source/tutorials/data_visualization.ipynb

from fairchem.core.preprocessing import AtomsToGraphs
import ase.io
from ase.build import bulk
from ase.build import fcc100, add_adsorbate, molecule
from ase.constraints import FixAtoms
from ase.calculators.emt import EMT
from ase.optimize import BFGS

Generate toy dataset: Relaxation of CO on Cu#

adslab = fcc100("Cu", size=(2, 2, 3))
ads = molecule("CO")
add_adsorbate(adslab, ads, 3, offset=(1, 1))
cons = FixAtoms(indices=[atom.index for atom in adslab if (atom.tag == 3)])
adslab.set_constraint(cons)
adslab.center(vacuum=13.0, axis=2)
adslab.set_pbc(True)
adslab.set_calculator(EMT())
dyn = BFGS(adslab, trajectory="CuCO_adslab.traj", logfile=None)
dyn.run(fmax=0, steps=1000)
/tmp/ipykernel_4190/901556023.py:8: DeprecationWarning: Please use atoms.calc = calc
  adslab.set_calculator(EMT())
False
raw_data = ase.io.read("CuCO_adslab.traj", ":")
print(len(raw_data))
1001

Convert Atoms object to Data object#

The AtomsToGraphs class takes in several arguments to control how Data objects created:

  • max_neigh (int): Maximum number of neighbors a given atom is allowed to have, discarding the furthest

  • radius (float): Cutoff radius to compute nearest neighbors around

  • r_energy (bool): Write energy to Data object

  • r_forces (bool): Write forces to Data object

  • r_distances (bool): Write distances between neighbors to Data object

  • r_edges (bool): Write neigbhor edge indices to Data object

  • r_fixed (bool): Write indices of fixed atoms to Data object

a2g = AtomsToGraphs(
    max_neigh=50,
    radius=6,
    r_energy=True,
    r_forces=True,
    r_distances=False,
    r_edges=True,
    r_fixed=True,
)
data_objects = a2g.convert_all(raw_data, disable_tqdm=True)
data = data_objects[0]
data
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 635], cell_offsets=[635, 3], edge_distance_vec=[635, 3], energy=3.9893144106683787, forces=[14, 3], fixed=[14])
data.atomic_numbers
tensor([29., 29., 29., 29., 29., 29., 29., 29., 29., 29., 29., 29.,  8.,  6.])
data.cell
tensor([[[ 5.1053,  0.0000,  0.0000],
         [ 0.0000,  5.1053,  0.0000],
         [ 0.0000,  0.0000, 32.6100]]])
data.edge_index #neighbor idx, source idx
tensor([[ 1,  2,  2,  ...,  5,  6,  3],
        [ 0,  0,  0,  ..., 13, 13, 13]])
from torch_geometric.utils import degree
# Degree corresponds to the number of neighbors a given node has. Note there is no more than max_neigh neighbors for
# any given node.

degree(data.edge_index[1]) 
tensor([45., 45., 45., 46., 49., 49., 49., 49., 50., 49., 49., 49., 26., 35.])
data.fixed
tensor([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.int32)
data.forces
tensor([[ 1.7794e-15,  2.2235e-15,  1.1354e-01],
        [-8.0838e-16,  1.1120e-15,  1.1344e-01],
        [ 2.1459e-15,  1.2733e-15,  1.1344e-01],
        [-5.7891e-16,  8.3663e-16,  1.1294e-01],
        [-8.5221e-03, -8.5221e-03, -1.1496e-02],
        [ 8.5221e-03, -8.5221e-03, -1.1496e-02],
        [-8.5221e-03,  8.5221e-03, -1.1496e-02],
        [ 8.5221e-03,  8.5221e-03, -1.1496e-02],
        [-1.6723e-15, -1.1735e-15, -1.0431e-01],
        [ 3.9409e-16, -1.5543e-15, -6.6610e-02],
        [-3.4001e-15, -8.4849e-17, -6.6610e-02],
        [ 1.8858e-15,  9.3691e-16, -3.3250e-01],
        [-4.4046e-20, -4.4046e-20, -3.4247e-01],
        [ 8.7549e-18, -5.1229e-18,  5.0512e-01]])
data.pos
tensor([[ 0.0000,  0.0000, 13.0000],
        [ 2.5527,  0.0000, 13.0000],
        [ 0.0000,  2.5527, 13.0000],
        [ 2.5527,  2.5527, 13.0000],
        [ 1.2763,  1.2763, 14.8050],
        [ 3.8290,  1.2763, 14.8050],
        [ 1.2763,  3.8290, 14.8050],
        [ 3.8290,  3.8290, 14.8050],
        [ 0.0000,  0.0000, 16.6100],
        [ 2.5527,  0.0000, 16.6100],
        [ 0.0000,  2.5527, 16.6100],
        [ 2.5527,  2.5527, 16.6100],
        [ 2.5527,  2.5527, 19.6100],
        [ 2.5527,  2.5527, 18.4597]])
data.energy
3.9893144106683787

Adding additional info to your Data objects#

In addition to the above information, the OCP repo requires several other pieces of information for your data to work with the provided trainers:

  • sid (int): A unique identifier for a particular system. Does not affect your model performance, used for prediction saving

  • fid (int) (S2EF only): If training for the S2EF task, your data must also contain a unique frame identifier for atoms objects coming from the same system.

  • tags (tensor): Tag information - 0 for subsurface, 1 for surface, 2 for adsorbate. Optional, can be used for training.

Other information may be added her as well if you choose to incorporate other information in your models/frameworks

data_objects = []
for idx, system in enumerate(raw_data):
    data = a2g.convert(system)
    data.fid = idx
    data.sid = 0 # All data points come from the same system, arbitrarly define this as 0
    data_objects.append(data)
data = data_objects[100]
data
Data(pos=[14, 3], cell=[1, 3, 3], atomic_numbers=[14], natoms=14, tags=[14], edge_index=[2, 638], cell_offsets=[638, 3], edge_distance_vec=[638, 3], energy=3.9683558933958047, forces=[14, 3], fixed=[14], fid=100, sid=0)
data.sid
0
data.fid
100

Resources:

  • https://github.com/Open-Catalyst-Project/ocp/blob/6604e7130ea41fabff93c229af2486433093e3b4/ocpmodels/preprocessing/atoms_to_graphs.py

  • https://github.com/Open-Catalyst-Project/ocp/blob/master/scripts/preprocess_ef.py