Here we present the results of our benchmarking of ARCHER2. The benchmarks we are running are our HECBioSim benchmark suite here. Our benchmarks aren't too heavily tuned as these are designed to be used by our community to work out how much HPC time to ask for on this resource, so are set at a level that would be reasonable for any normal biomolecular MD simulation. More information on the benchmarks themselves can be found at the above link.
ARCHER2 is hosted by the EPCC and is the new Tier1 national machine in the UK. ARCHER2 is comprised of the Cray EX (formerly shasta) architecture and will consist of 5,848 nodes with each compute node having two 64-core AMD Zen2 (Rome) 7742 (8 NUMA regions) processors bringing a total of 748,544 cores. The majority of these nodes 5,848 having 256GB of RAM whilst 292 have high memory (512GB RAM). The compute nodes are connected via two high speed HPE Cray Slingshot bidirectional interconnects for fast parallel computation and have access to high performance parallel filesystems.
AMBER 20
nodes | cores | 20k atoms | 61k atoms | 465k atoms | 1.4M atoms | 3M atoms |
ns/day | ns/day | ns/day | ns/day | ns/day | ||
1 | 128 | 104.36 | 27.31 | 2.92 | 1.24 | 0.6 |
2 | 256 | 89.13 | 33.32 | 4.31 | 1.77 | 0.83 |
3 | 384 | 80.21 | 29.09 | 5.25 | 1.85 | 0.96 |
4 | 512 | 66.42 | 24.51 | 5.12 | 1.91 | 1.01 |
5 | 640 | 50.71 | 25.18 | 5.33 | 2.25 | 1.09 |
6 | 768 | 32.62 | 24.70 | 4.55 | 2.26 | 1.25 |
7 | 896 | 32.69 | 23.72 | 4.08 | failed | 1.15 |
8 | 1024 | 27.72 | 22.03 | 3.88 | failed | 1.26 |
9 | 1152 | 24.95 | 21.47 | 3.87 | 2.05 | 1.26 |
10 | 1280 | 21.07 | 19.58 | 4.25 | 2.25 | 1.14 |
12 | 1536 | 4.14 | failed | 1.16 | ||
14 | 1792 | 3.99 | 2.05 | 0.75 | ||
16 | 2048 | 3.59 | 2.14 | 0.92 | ||
18 | 2304 | 3.5 | 2.24 | 1.07 | ||
20 | 2560 | 3.38 | failed | 1.08 | ||
22 | 2816 | failed | 1.03 | |||
24 | 3072 | failed | 1.02 | |||
28 | 3584 | 1.79 | 1.02 |
Gromacs 2020.3 Single Precision
With gromacs there are many options available to the user for parallelisation. On ARCHER2 the main options that make sense are the number of MPI ranks and the number of OpenMP threads assigned per node. We benchmarked a number of different configurations on ARCHER2 for each of the nodes counts in the table below. However the dataset is very large so we are only going to present the fastest case in the table below. The chart to the right shows a snapshot of these different configurations for one of the single node cases below. As can be seen, by far the best configuration is the case in which all processors are occupied by a single MPI thread.
nodes | cores | 20k atoms | 61k atoms | 465k atoms | 1.4M atoms | 3M atoms |
ns/day | ns/day | ns/day | ns/day | ns/day | ||
1 | 128 | 250.9 | 126.1 | 11.1 | 5.9 | 2.5 |
2 | 256 | 567.6 | 188.3 | 28.0 | 11.4 | 4.9 |
3 | 384 | 646.3 | 293.7 | 38.3 | 18.9 | 8.3 |
4 | 512 | 688.6 | 316.6 | 42.1 | 22.8 | 9.6 |
5 | 640 | 686.8 | 399.3 | 48.1 | 29.7 | 13.1 |
6 | 768 | failed | 404.8 | 56.1 | 34.6 | 15.1 |
7 | 896 | failed | 410.7 | 73.8 | 33.6 | 18.9 |
8 | 1024 | failed | 429.9 | 81.2 | 41.8 | 19.6 |
9 | 1152 | 466.1 | 92.6 | 47.1 | 22.9 | |
10 | 1280 | failed | 100.6 | 49.8 | 24.9 | |
12 | 1536 | failed | 107.8 | 60.2 | 30.0 | |
14 | 1792 | failed | 119.1 | 61.4 | 30.2 | |
16 | 2048 | 126.3 | 69.8 | 31.8 | ||
18 | 2304 | 137.7 | 64.3 | 38.8 | ||
20 | 2560 | 151.8 | 83.6 | 41.5 | ||
22 | 2816 | 152.4 | 82.0 | 42.3 | ||
24 | 3072 | 159.9 | 97.3 | 48.7 | ||
28 | 3584 | 173.3 | 105.6 | 55.3 | ||
32 | 4096 | 183.7 | 101.3 | 59.2 | ||
40 | 5120 | failed | 123.6 | 70.8 | ||
48 | 6144 | failed | 134.2 | 83.1 | ||
56 | 7168 | failed | 126.9 | 93.3 | ||
64 | 8192 | 126.7 | 93.8 | |||
72 | 9216 | 127.4 | 97.1 | |||
80 | 10240 | failed | 104.1 | |||
88 | 11264 | failed | 91.3 | |||
96 | 12288 | failed | 110.5 | |||
104 | 13312 | 91.3 | ||||
112 | 14336 | 111.6 | ||||
120 | 15360 | 111.2 | ||||
136 | 17408 | 104.1 | ||||
152 | 19456 | 94.8 | ||||
168 | 21504 | failed | ||||
184 | 23552 | failed | ||||
200 | 25600 | failed |
LAMMPS 03.3.2020
nodes | cores | 20k atoms | 61k atoms | 465k atoms | 1.4M atoms | 3M atoms |
ns/day | ns/day | ns/day | ns/day | ns/day | ||
1 | 128 | 37.0 | 14.3 | 2.1 | 0.7 | 0.10 |
2 | 256 | 58.7 | 25.0 | 3.8 | 1.3 | 0.20 |
3 | 384 | 74.4 | 31.4 | 5.6 | 2.0 | 0.30 |
4 | 512 | 84.9 | 40.1 | 7.2 | 2.5 | 0.38 |
5 | 640 | 81.0 | 45.6 | 8.7 | 3.1 | 0.47 |
6 | 768 | 90.6 | 49.2 | 10.2 | 3.7 | 0.57 |
7 | 896 | 94.1 | 53.0 | 11.3 | 4.2 | 0.66 |
8 | 1024 | 100.7 | 58.7 | 12.9 | 4.8 | 0.74 |
9 | 1152 | 101.3 | 63.7 | 13.9 | 5.2 | 0.84 |
10 | 1280 | 100.6 | 64.9 | 14.8 | 5.9 | 0.93 |
12 | 1536 | 103.5 | 69.2 | 17.3 | 6.6 | 1.05 |
14 | 1792 | 101.6 | 71.9 | 19.2 | 7.5 | 1.22 |
16 | 2048 | 103.6 | 77.6 | 21.0 | 8.4 | 1.33 |
18 | 2304 | 103.2 | 71.2 | 22.7 | 9.2 | 1.53 |
20 | 2560 | 101.8 | 73.8 | 23.7 | 10.2 | 1.71 |
22 | 2816 | 97.04 | 77.9 | 24.5 | 10.8 | 1.87 |
24 | 3072 | 103.7 | 75.8 | 27.1 | 11.4 | 2.01 |
28 | 3584 | 75.2 | 29.4 | 13.0 | 2.30 | |
32 | 4096 | 75.7 | 30.6 | 14.2 | 2.60 | |
40 | 5120 | 16.3 | 3.14 | |||
48 | 6144 | 18.1 | 3.70 | |||
56 | 7168 | 19.5 | 4.20 | |||
64 | 8192 | 21.2 | 4.73 | |||
72 | 9216 | 20.3 | 5.14 | |||
80 | 10240 | 21.1 | 5.52 | |||
88 | 11264 | 21.1 | 5.98 | |||
96 | 12288 | 21.1 | 6.16 | |||
104 | 13312 | 21.6 | 6.49 | |||
112 | 14336 | 20.1 | 6.93 | |||
120 | 15360 | 23.3 | 7.20 | |||
136 | 17408 | 7.77 | ||||
152 | 19456 | 8.29 | ||||
168 | 21504 | 8.69 | ||||
184 | 23552 | 8.91 | ||||
200 | 25600 | 9.33 |
NAMD 2.14
With NAMD there are many options available to the user for parallelisation. On ARCHER2 the main options that make sense are the number of MPI ranks and the number of OpenMP threads assigned per node. We benchmarked a number of different configurations on ARCHER2 for each of the nodes counts in the table below. However the dataset is very large so we are only going to present the fastest case in the table below. The chart to the right shows a snapshot of these different configurations for one of the single node cases below. As can be seen, the best configurations are when each node is occupied by either 16 MPI and 8 OMP threads or marginally faster 32 MPI and 4 OMP threads per node.
nodes | cores | 20k atoms | 61k atoms | 465k atoms | 1.4M atoms | 3M atoms |
ns/day | ns/day | ns/day | ns/day | ns/day | ||
1 | 128 | 84.7 | 32.2 | 4.3 | 1.4 | 0.50 |
2 | 256 | 123.0 | 59.6 | 8.4 | 2.8 | 0.91 |
3 | 384 | 159.0 | 83.0 | 12.7 | 2.9 | 1.38 |
4 | 512 | 172.0 | 76.4 | 16.4 | 5.5 | 2.60 |
5 | 640 | 209.3 | 107.9 | 20.3 | 4.7 | 2.20 |
6 | 768 | 225.1 | 110.7 | 23.4 | 5.4 | 2.63 |
7 | 896 | 232.1 | 103.6 | 18.8 | 7.9 | 3.32 |
8 | 1024 | 196.8 | 117.0 | 25.7 | 7.7 | 5.05 |
9 | 1152 | 226.5 | 126.1 | 33.6 | 8.8 | 5.71 |
10 | 1280 | 233.7 | 123.9 | 36.5 | 9.1 | 4.33 |
12 | 1536 | 228.3 | 142.4 | 41.6 | 11.3 | 7.64 |
14 | 1792 | 233.9 | 149.2 | 45.2 | 12.9 | 8.77 |
16 | 2048 | 245.3 | 140.0 | 40.6 | 15.1 | 9.82 |
18 | 2304 | 262.9 | 167.5 | 48.3 | 23.0 | 10.43 |
20 | 2560 | 244.9 | 167.2 | 50.4 | 18.3 | 8.79 |
22 | 2816 | 231.5 | 158.9 | 55.0 | 21.3 | 9.04 |
24 | 3072 | 241.7 | 181.9 | 64.1 | 28.8 | 10.14 |
28 | 3584 | 167.4 | 60.9 | 24.8 | 16.34 | |
32 | 4096 | 109.8 | 81.7 | 36.1 | 17.90 | |
40 | 5120 | 72.8 | 32.8 | 22.22 | ||
48 | 6144 | 96.1 | 35.7 | 26.11 | ||
56 | 7168 | 101.0 | 41.6 | 23.38 | ||
64 | 8192 | 106.7 | 49.4 | 26.66 | ||
72 | 9216 | 108.8 | 50.3 | 32.36 | ||
80 | 10240 | 112.9 | 41.6 | 30.73 | ||
88 | 11264 | 119.5 | 42.3 | 33.86 | ||
96 | 12288 | 93.1 | 51.4 | 33.45 | ||
104 | 13312 | 99.0 | 51.3 | 33.84 | ||
112 | 14336 | 122.8 | 58.2 | 33.84 | ||
120 | 15360 | 115.3 | 52.1 | 32.77 | ||
136 | 17408 | 111.0 | 68.7 | 37.97 | ||
152 | 19456 | 139.8 | 67.9 | 42.25 | ||
168 | 21504 | 117.2 | 67.2 | 47.59 | ||
184 | 23552 | 96.8 | 76.0 | 38.84 | ||
200 | 25600 | 66.6 | 40.73 | |||
240 | 30720 | 68.9 | 45.49 | |||
280 | 35840 | 81.3 | 44.67 | |||
320 | 40960 | 65.6 | 52.17 | |||
360 | 46080 | 73.5 | 48.89 | |||
400 | 51200 | 62.4 | 44.51 | |||
440 | 56320 | 45.56 | ||||
480 | 61440 | 43.87 | ||||
520 | 66560 | 45.37 | ||||
560 | 71680 | 43.54 | ||||
600 | 76800 | 45.64 | ||||
640 | 81920 | 39.98 | ||||
680 | 87040 | 35.93 |