archer2 photoHere we present the results of our benchmarking of ARCHER2. The benchmarks we are running are our HECBioSim benchmark suite here. Our benchmarks aren't too heavily tuned as these are designed to be used by our community to work out how much HPC time to ask for on this resource, so are set at a level that would be reasonable for any normal biomolecular MD simulation. More information on the benchmarks themselves can be found at the above link.

ARCHER2 is hosted by the EPCC and is the new Tier1 national machine in the UK. ARCHER2 is comprised of the Cray EX (formerly shasta) architecture and will consist of 5,848 nodes with each compute node having two 64-core AMD Zen2 (Rome) 7742 (8 NUMA regions) processors bringing a total of 748,544 cores. The majority of these nodes 5,848 having 256GB of RAM whilst 292 have high memory (512GB RAM). The compute nodes are connected via two high speed HPE Cray Slingshot bidirectional interconnects for fast parallel computation and have access to high performance parallel filesystems. 

 

 

AMBER 20

nodes cores 20k atoms 61k atoms 465k atoms 1.4M atoms 3M atoms
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
1 128 104.36   27.31   2.92   1.24   0.6  
2 256 89.13   33.32   4.31   1.77   0.83  
3 384 80.21   29.09   5.25   1.85   0.96  
4 512 66.42   24.51   5.12   1.91   1.01  
5 640 50.71   25.18   5.33   2.25   1.09  
6 768 32.62   24.70   4.55   2.26   1.25  
7 896 32.69   23.72   4.08   failed   1.15  
8 1024  27.72   22.03   3.88   failed   1.26  
9 1152 24.95   21.47   3.87   2.05   1.26  
10 1280 21.07   19.58   4.25   2.25   1.14  
12 1536         4.14   failed   1.16  
14 1792         3.99   2.05   0.75  
16 2048         3.59   2.14   0.92  
18 2304         3.5   2.24   1.07  
20 2560         3.38   failed   1.08  
22 2816             failed   1.03  
24 3072             failed   1.02  
28 3584             1.79   1.02  

 

Gromacs 2020.3 Single Precision

mpi vs openmp combinations chartWith gromacs there are many options available to the user for parallelisation. On ARCHER2 the main options that make sense are the number of MPI ranks and the number of OpenMP threads assigned per node. We benchmarked a number of different configurations on ARCHER2 for each of the nodes counts in the table below. However the dataset is very large so we are only going to present the fastest case in the table below. The chart to the right shows a snapshot of these different configurations for one of the single node cases below. As can be seen, by far the best configuration is the case in which all processors are occupied by a single MPI thread.

nodes cores 20k atoms 61k atoms 465k atoms 1.4M atoms 3M atoms
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
1 128 250.9    126.1    11.1   5.9   2.5  
2 256  567.6    188.3    28.0    11.4    4.9  
3 384  646.3    293.7    38.3    18.9    8.3  
4 512  688.6    316.6    42.1    22.8    9.6  
5 640  686.8    399.3    48.1    29.7    13.1  
6 768  failed    404.8    56.1    34.6    15.1  
7 896  failed    410.7    73.8    33.6    18.9  
8 1024  failed    429.9    81.2    41.8    19.6  
9 1152      466.1    92.6    47.1    22.9  
10 1280      failed    100.6    49.8    24.9  
12 1536      failed   107.8    60.2    30.0  
14 1792      failed    119.1    61.4    30.2  
16 2048          126.3    69.8    31.8  
18 2304          137.7    64.3    38.8  
20 2560          151.8    83.6    41.5  
22 2816          152.4    82.0    42.3  
24 3072         159.9   97.3   48.7  
28 3584         173.3   105.6   55.3  
32 4096         183.7   101.3   59.2  
40 5120         failed   123.6   70.8  
48 6144         failed   134.2   83.1  
56 7168         failed   126.9   93.3  
64 8192             126.7   93.8  
72 9216             127.4   97.1  
80 10240             failed   104.1  
88 11264             failed   91.3  
96 12288             failed   110.5  
104 13312                 91.3  
112 14336                 111.6  
120 15360                 111.2  
136 17408                 104.1  
152 19456                 94.8  
168 21504                 failed  
184 23552                 failed  
200 25600                 failed  

 

LAMMPS 03.3.2020

nodes cores 20k atoms 61k atoms 465k atoms 1.4M atoms 3M atoms
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
1 128 37.0   14.3    2.1    0.7    0.10  
2 256 58.7   25.0    3.8    1.3    0.20  
3 384 74.4   31.4    5.6    2.0    0.30  
4 512 84.9   40.1    7.2    2.5    0.38  
5 640 81.0   45.6    8.7    3.1    0.47  
6 768 90.6   49.2    10.2    3.7    0.57  
7 896 94.1   53.0    11.3    4.2    0.66  
8 1024 100.7   58.7    12.9    4.8    0.74  
9 1152 101.3   63.7    13.9    5.2    0.84  
10 1280 100.6   64.9    14.8    5.9    0.93  
12 1536 103.5   69.2    17.3    6.6    1.05  
14 1792 101.6   71.9    19.2    7.5    1.22  
16 2048 103.6   77.6    21.0    8.4    1.33  
18 2304 103.2   71.2    22.7    9.2    1.53  
20 2560 101.8   73.8    23.7    10.2    1.71  
22 2816 97.04   77.9    24.5    10.8    1.87  
24 3072 103.7   75.8    27.1    11.4    2.01  
28 3584     75.2    29.4    13.0    2.30  
32 4096     75.7    30.6    14.2    2.60  
40 5120              16.3    3.14  
48 6144              18.1    3.70  
56 7168              19.5    4.20  
64 8192              21.2    4.73  
72 9216              20.3    5.14  
80 10240              21.1    5.52  
88 11264              21.1    5.98  
96 12288              21.1    6.16  
104 13312              21.6    6.49  
112 14336              20.1    6.93  
120 15360              23.3    7.20  
136 17408                  7.77  
152 19456                  8.29  
168 21504                  8.69  
184 23552                  8.91  
200 25600                  9.33  

 

NAMD 2.14

mpi vs openmp combinations chartWith NAMD there are many options available to the user for parallelisation. On ARCHER2 the main options that make sense are the number of MPI ranks and the number of OpenMP threads assigned per node. We benchmarked a number of different configurations on ARCHER2 for each of the nodes counts in the table below. However the dataset is very large so we are only going to present the fastest case in the table below. The chart to the right shows a snapshot of these different configurations for one of the single node cases below. As can be seen, the best configurations are when each node is occupied by either 16 MPI and 8 OMP threads or marginally faster 32 MPI and 4 OMP threads per node.

nodes cores 20k atoms 61k atoms 465k atoms 1.4M atoms 3M atoms
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
ns/day ns/CU
1 128 84.7    32.2    4.3    1.4    0.50  
2 256  123.0    59.6    8.4    2.8    0.91  
3 384  159.0    83.0    12.7    2.9    1.38  
4 512  172.0    76.4    16.4    5.5    2.60  
5 640  209.3    107.9    20.3    4.7    2.20  
6 768  225.1    110.7    23.4    5.4    2.63  
7 896  232.1    103.6    18.8    7.9   3.32  
8 1024  196.8    117.0    25.7    7.7    5.05  
9 1152  226.5    126.1    33.6    8.8    5.71  
10 1280  233.7   123.9    36.5    9.1    4.33  
12 1536  228.3    142.4    41.6    11.3    7.64  
14 1792  233.9    149.2    45.2    12.9    8.77  
16 2048  245.3    140.0    40.6    15.1    9.82  
18 2304  262.9    167.5    48.3    23.0    10.43  
20 2560  244.9    167.2    50.4    18.3    8.79  
22 2816 231.5    158.9    55.0    21.3   9.04  
24 3072  241.7    181.9   64.1    28.8    10.14  
28 3584      167.4    60.9   24.8    16.34  
32 4096      109.8    81.7    36.1    17.90  
40 5120          72.8    32.8   22.22  
48 6144         96.1    35.7    26.11  
56 7168          101.0    41.6   23.38  
64 8192          106.7    49.4    26.66  
72 9216          108.8    50.3    32.36  
80 10240          112.9    41.6    30.73  
88 11264          119.5    42.3   33.86  
96 12288          93.1    51.4    33.45  
104 13312          99.0    51.3    33.84  
112 14336          122.8    58.2    33.84  
120 15360          115.3    52.1   32.77  
136 17408          111.0    68.7    37.97  
152 19456          139.8    67.9    42.25  
168 21504          117.2   67.2    47.59  
184 23552          96.8    76.0    38.84  
200 25600              66.6    40.73  
240 30720             68.9   45.49  
 280  35840             81.3   44.67  
 320  40960             65.6   52.17  
 360  46080             73.5   48.89  
 400  51200             62.4   44.51  
 440  56320                 45.56  
 480  61440                 43.87  
 520  66560                 45.37  
 560  71680                 43.54  
 600  76800                 45.64  
 640  81920                 39.98  
 680  87040                 35.93