Contents

Benchmark methodology

Systems tested

JADE2 (NVIDIA Tesla V100) Benchmarks

ARCHER2 (EPYC 7742) Benchmarks
    Single node
    Multiple nodes
    OpenMP and MPI

GH200 Testbed Benchmarks

LUMI-G (MI250x) Benchmarks

Comparison across HPC systems
    ARCHER2 and JADE2
    JADE2 and GH200 testbed
    Comparison of MD software across different HPC systems

Memory usage

Multi-instance GPU (MIG)

 

Benchmark methodology

The benchmark input files taken from the HECBioSim benchmark suite. Benchmarks are run and plots are created using the hpcbench utility. The hpcbench repository also contains the SLURM input scripts used to run each simulation. Benchmarks were conducted using the version of each MD program available on each HPC system, so benchmarks running on different systems are not always running the same software version. More details on the method and more results can be found in the report 'Engineering Supercomputing Platforms for Biomolecular Applications'. Raw benchmark data can be found in the HECBioSim benchmark data repository (note: this data requires HPCbench to parse).

The plots in this section are given in terms of performance (ns/day) and energy usage (kWh/ns). The energy usage is calculated using the energy consumption data reported by SLURM.

 

Systems tested

System (PDB codes) No. of atoms Protein atoms Lipid atoms Water atoms
3NIR Crambin 21k 642 0 19k
1WDN Glutamine-Binding Protein 61k 3.5k 0 58k
hEGFR Dimer of 1IVO and 1NQL 465k 22k 134k 309k
Two hEGFR Dimers of 1IVO and 1NQL 1.4M 43k 235k 1.1M
Two hEGFR tetramers of 1IVO and 1NQL 3M 87k 868k 2M

 

JADE2 (NVIDIA Tesla V100) Benchmarks

On JADE2, benchmarks were run on a single GPU. Though each JADE2 node contains eight NVIDIA Tesla V100 GPUs, the multi-GPU performance was poor. Benchmarks were run with a corresponding one-eigth of the node's available CPUs (dual Xeon e5-2698s). Performance of CPU-bound software (e.g. LAMMPS) suffers the most.

JADE2 molecular dynamics performance compared for five different molecular dynamics programs.
JADE2 molecular dynamics energy usage compared for five different molecular dynamics programs.

 

ARCHER2 Benchmarks

Single node

Each ARCHER2 node contains dual AMD EPYC 7742 CPUs for a total of 256 CPU cores. OpenMM was not tested on ARCHER2, as it is not designed to run on CPUs and only provides a reference CPU implementation.

ARCHER2 single node molecular dynamics performance compared for five different molecular dynamics programs.
ARCHER2 single-node energy usage compared for five different molecular dynamics programs.

 

Multi-node

Benchmarks were run on up to 16 ARCHER2 nodes. While the scaling is favourable (particularly for the larger systems), there is a significant cost to efficiency irrespective of the system and MD program.

ARCHER2 MD performance for 20k atom system
ARCHER2 MD energy usage for 20k atom system
ARCHER2 MD performance for 61k atom system
ARCHER2 MD energy usage for 61k atom system
ARCHER2 MD performance for 61k atom system
ARCHER2 MD energy usage for 465k atom system
ARCHER2 MD performance for 1400k atom system
ARCHER2 MD energy usage for 1400k atom system
ARCHER2 MD performance for 3000k atom system
ARCHER2 MD power usage for 3000k atom system

 

MPI and OpenMP scaling

Different combinations of MPI and OpenMP were tested for GROMACS, NAMD and LAMMPS, which support mixed OpenMP and MPI parallelisation. For each figure, the legend shows how many OpenMP threads were used, with the rest of the parallelisation being handled with MPI processes. In almost all cases, pure MPI parallelisation was fastest.

GROMACS
Gromacs MD performance for 21k atoms and different numbers of OpenMP threads
Gromacs MD performance for 61k atoms and different numbers of OpenMP threads
Gromacs MD performance for 465k atoms
Gromacs MD performance for 1400k atoms and different numbers of OpenMP threads
Gromacs MD performance for 3000k atoms and different numbers of OpenMP threads
NAMD
NAMD MD performance for 21k atoms and different numbers of OpenMP threads
NAMD MD performance for 61k atoms and different numbers of OpenMP threads
NAMD MD performance for 465k atoms and different numbers of OpenMP threads
NAMD MD performance for 1400k atoms and different numbers of OpenMP threads
NAMD MD performance for 3000k atoms and different numbers of OpenMP threads
LAMMPS
LAMMPS MD performance for 21k atoms and different numbers of OpenMP threads
LAMMPS MD performance for 61k atoms and different numbers of OpenMP threads
LAMMPS MD performance for 465k atoms and different numbers of OpenMP threads
LAMMPS MD performance for 1400k atoms and different numbers of OpenMP threads
LAMMPS MD performance for 3000k atoms and different numbers of OpenMP threads

GH200

The benchmark suite was run on the Nvidia Grace Hopper GH200 testbed at BEDE. Each GH200 chip contains a 'Hopper' GPU and a fast ARM Neoverse 96-core ARM CPU. Though the GH200 is marketed around AI, it provides excellent molecular dynamics performance and perfect compatibility with existing MD software.

MD performance on GH200 testbed
MD energy use on GH200 testbed

 

LUMI-G (AMD MI250X)

OpenMM couldn't be tested on LUMI-G due to a temporary incompatibillity with the OpenMM ROCm plugin at time of testing. Other MD software ran on LUMI-G, though LAMMPS was only supported via Kokkos, which does not implement some fixes used in the benchmark. In general, performance on the MI250x is comparable to contemporary Nvidia GPUs, though power efficiency is worse.

MD performance on LUMI-G
MD energy use on LUMI-G

Comparison between different HPC systems

 

ARCHER2 and JADE2

ARCHER2 MD performance vs JADE2 performance
ARCHER2 MD energy usage vs JADE2 MD energy usage

JADE2 and GH200 testbed

JADE2 MD performance vs GH200 MD performance
JADE2MD energy usage vs GH200 MD energy usage

Comparison of MD software across different HPC systems

AMBER MD performance across different HPC systems
AMBER MD energy usage across different HPC systems
GROMACS MD performance across different HPC systems
GROMACS MD energy usage across different HPC systems
LAMMPS MD performance across different HPC systems
LAMMPS MD energy usage across different HPC systems
NAMD MD performance across different HPC systems
NAMD MD energy usage across different HPC systems

Memory usage

MD memory usage on ARCHER2
MD GPU memory usage on JADE2

Multi-instance GPU (MIG)

MIG is an Nvidia GPU feature which allows single GPUs to be partitioned into multiple instances, which can be addressed as if they were separate GPUs. Particularly for smaller, less memory-intensive MD workloads, MIG can provide equivalent or slightly better cumulative ns/day than running a single exclusive MD job. GROMACS cannot scale to many replicas, though this is likely due to GROMACS being CPU-limited.

MIG performance per replica
MIG performance (cumulative, total ns/day of the system)