Datasets in FAIRChem¶
fairchem provides training and evaluation code for tasks and models that take arbitrary chemical structures as input to predict energies, forces, positions, and stresses. It can be used as a base scaffold for research projects. For an overview of tasks, data, and metrics, please read the documentation and respective papers:
Custom ASE Databases¶
If your data is already in an ASE Database, no additional preprocessing is necessary before running training/prediction!
Performance Tip
If your dataset is small enough to fit in CPU memory, use the keep_in_memory: True option to avoid I/O bottlenecks and significantly speed up training.
To use this dataset, we will just have to change our config files to use the ASE DB Dataset rather than the LMDB Dataset:
dataset:
format: ase_db
train:
src: # The path/address to your ASE DB
connect_args:
# Keyword arguments for ase.db.connect()
select_args:
# Keyword arguments for ase.db.select()
# These can be used to query/filter the ASE DB
a2g_args:
r_energy: True
r_forces: True
# Set these if you want to train on energy/forces
# Energy/force information must be in the ASE DB!
keep_in_memory: False # Keeping the dataset in memory reduces random reads and is extremely fast, but this is only feasible for relatively small datasets!
include_relaxed_energy: False # Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
val:
src:
a2g_args:
r_energy: True
r_forces: True
test:
src:
a2g_args:
r_energy: False
r_forces: False
# It is not necessary to have energy or forces if you are just making predictions.Using ASE-Readable Files¶
It is possible to train/predict directly on ASE-readable files.
There are two options for loading data with the ASE reader:
Single-Structure Files¶
This dataset assumes a single structure will be obtained from each file:
dataset:
format: ase_read
train:
src: # The folder that contains ASE-readable files
pattern: # Pattern matching each file you want to read (e.g. "*/POSCAR"). Search recursively with two wildcards: "**/*.cif".
include_relaxed_energy: False # Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
ase_read_args:
# Keyword arguments for ase.io.read()
a2g_args:
# Include energy and forces for training purposes
# If True, the energy/forces must be readable from the file (ex. OUTCAR)
r_energy: True
r_forces: True
keep_in_memory: FalseMulti-structure Files¶
This dataset supports reading files that each contain multiple structures (for example, an ASE .traj file).
dataset:
format: ase_read_multi
train:
index_file: Filepath to an index file which contains each filename and the number of structures in each file. e.g.:
/path/to/relaxation1.traj 200
/path/to/relaxation2.traj 150
...
# If using an index file, the src and pattern are not necessary
src: # The folder that contains ASE-readable files
pattern: # Pattern matching each file you want to read (e.g. "*.traj"). Search recursively with two wildcards: "**/*.xyz".
ase_read_args:
# Keyword arguments for ase.io.read()
a2g_args:
# Include energy and forces for training purposes
r_energy: True
r_forces: True
keep_in_memory: False