# Running Model Benchmarks

Model benchmarks involve evaluating a model on downstream property predictions involving several model evaluations to calculate a single or set of related properties. For example, calculating structure relaxations, elastic tensors, phonons, or adsorption energy.

:::{danger} Security Warning
**Never run YAML configuration files from untrusted sources.** FAIRChem uses [Hydra](https://hydra.cc/) to instantiate Python objects from YAML configs via the `_target_` key. A maliciously crafted config file can execute arbitrary code on your machine. Only use configs that you have written yourself or that come from trusted sources. This is analogous to the security risks of Python's `pickle` and `torch.load()`.
:::

## Available Benchmark Configurations

To benchmark UMA models on standard datasets, you can find benchmark configuration files in `configs/uma/benchmark`. Example files include:
- `adsorbml.yaml`
- `hea-is2re.yaml`
- `kappa103.yaml`
- `matbench-discovery-discovery.yaml`
- `mdr-phonon.yaml`

:::{note}
To run these UMA benchmarks you will need to obtain the target data.
:::

## Running Benchmarks

Run the benchmark script using the fairchem CLI, specifying the benchmark config:

```bash
fairchem --config configs/uma/benchmark/benchmark.yaml
```

Replace `benchmark.yaml` with the desired benchmark config file.

:::{tip}
Benchmark results are saved to a **results** directory under the **run_dir** specified in the configuration file. Additionally, benchmark metrics are logged using the specified logger. We currently only support Weights and Biases.
:::

## Benchmark Configuration File Format

Evaluation configuration files are written in Hydra YAML format and specify how a model evaluation should be run. UMA evaluation configuration files, which can be used as templates to evaluate other models if needed, are located in `configs/uma/evaluate/`.

### Top-Level Keys

The benchmark configuration files follow the same format as model training and evaluation configuration files, with the addition of a **reducer** flag to specify how final metrics are calculated from the results of a given benchmark calculation protocol.

A benchmark configuration file should define the following top level keys:

- **job**: Contains all settings related to the evaluation job itself, including model, data, and logger configuration. For additional details see the description given in the Evaluation page.
- **runner**: Contains settings for a `CalculateRunner` which implements a downstream property calculation or simulation.
- **reducer**: Contains the settings for a `BenchmarkReducer` class which defines how to aggregate the results calculated by the `CalculateRunner` and computes metrics based on given target values.

### CalculateRunners

The benchmark details including the type of calculations and the model checkpoint are specified under the runner flag. The specific benchmark calculations are based on the chosen `CalculateRunner` (for example a `RelaxationRunner`). Several `CalculateRunner` implementations are found in the `fairchem.core.components.calculate` submodule.

:::{admonition} Implementing New Calculations
:class: dropdown

It is straightforward to write your own calculations in a `CalculateRunner`. Although implementation is very flexible and open ended, we suggest that you have a look at the interface set up by the `CalculateRunner` base class. At a minimum you will need to implement the following methods:

```python
def calculate(self, job_num: int = 0, num_jobs: int = 1) -> R:
    """Implement your calculations here by iterating over the self.input_data attribute"""


def write_results(
    self, results: R, results_dir: str, job_num: int = 0, num_jobs: int = 1
) -> None:
    """Write the results returned by your calculations in the method above"""
```

You will also see `save_state` and `load_state` abstract methods that you can use to checkpoint calculations. However, in most cases if calculations are fast enough you will not need these and you can simply implement them as empty methods.
:::


### BenchmarkReducers

A `CalculateRunner` will run calculations over a given set of structures and write out results. In order to compute benchmark metrics, a `BenchmarkReducer` is used to aggregate all these results, compute metrics and report them. Implementations of `BenchmarkReducer` classes are found in the `fairchem.core.components.benchmark` submodule.

:::{admonition} Implementing Custom Metrics
:class: dropdown

If you want to implement your own benchmark metric calculation you can write a `BenchmarkReducer` class. At a minimum, you will need to implement the following methods:

```python
def join_results(self, results_dir: str, glob_pattern: str) -> R:
    """Join your results from multiple files into a single result object."""


def save_results(self, results: R, results_dir: str) -> None:
    """Save joined results to a single file"""


def compute_metrics(self, results: R, run_name: str) -> M:
    """Compute metrics using the joined results and target data in your BenchmarkReducer."""


def save_metrics(self, metrics: M, results_dir: str) -> None:
    """Save the computed metrics to a file."""


def log_metrics(self, metrics: M, run_name: str):
    """Log metrics to the configured logger."""
```

:::{tip}
If it makes sense for your benchmark metrics and are happy working with dictionaries and pandas `DataFrames`, a lot of boilerplate code is implemented in the `JsonDFReducer`. We recommend that you start there by deriving your class from it, and focusing only on implementing the `compute_metrics` method.
:::
:::