Terascaling Applications on HPCx: The First 12 Months Mike Ashworth HPCx Terascaling Team HPCx Service CCLRC Daresbury Laboratory UK [email protected] http://www.hpcx.ac.uk/ Outline Terascaling Objectives Case Studies DL-POLY CRYSTAL CASTEP AMBER PFARM PCHAN POLCOMS

Application, and not H/W driven Efficiency of Codes Summary HPCx Annual Seminar 10th December 2003 2 Terascaling Objectives HPCx Annual Seminar 10th December 2003 3 Terascaling Objectives Jobs which use >= 50% of cpus The primary aim of the HPCx service is Capability Computing Key objective that user codes should scale to O(1000) cpus Largest part of our science support is the Terascaling Team Understanding performance and scaling of key codes Enabling world-leading calculations (demonstrators) Closely linked with Software Engineering Team and Applications

Support Team HPCx Annual Seminar 10th December 2003 4 Strategy for Capability Computing Performance Attributes of Key Applications Trouble-shooting with Vampir & Paraver Scalability of Numerical Algorithms Parallel eigensolvers. FFTs etc Optimisation of Communication Collectives e.g., MPI_ALLTOALLV and CASTEP New Techniques Mixed-mode programming

Memory-driven Approaches HPCx Terascaling Team e.g., In-core SCF & DFT, direct minimisation & CRYSTAL Migration from replicated to distributed data e.g., DL_POLY3 Scientific drivers amenable to Capability Computing - Enhanced Sampling Methods, Replica Methods HPCx Annual Seminar 10th December 2003 5 Case Studies

HPCx Annual Seminar 10th December 2003 6 Molecular Simulation DL_POLY W. Smith and T.R. Forester, CLRC Daresbury Laboratory General purpose molecular dynamics simulation package http://www.cse.clrc.ac.uk/msi/software/DL_POLY/ HPCx Annual Seminar 10th December 2003 7 DL_POLY3 Coulomb Energy Performance Distributed Data SPME, with revised FFT Scheme Performance Relative 7 to the Cray T3E/1200E

IBM SP/Regatta-H AlphaServer SC ES45/1000 SGI Origin 3800/R14k-500 6 5 DL_POLY3 216,000 ions, 200 time steps, Cutoff=12 4 3 2 1 0 32 HPCx Annual Seminar 64 128 Number of CPUs 10th December 2003

256 8 DL_POLY3 Macromolecular Simulations Gramicidin in water; rigid bonds + SHAKE: Measured Time (seconds) 800 792,960 ions, 50 time steps 749 Performance Relative to the SGI Origin 3800/R14k-500 SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 600 2.5 IBM SP/Regatta-H 400

IBM SP/Regatta-H AlphaServer SC ES45/1000 2 396 349 312 1.5 200 189 200 176 116 114 1 115 73 77 0.5 0 32 64

128 Number of CPUs 256 0 32 64 128 256 Number of CPUs HPCx Annual Seminar 10th December 2003 9 Materials Science CRYSTAL calculate wave-functions and properties of crystalline systems periodic Hartree-Fock or density functional Kohn-Sham Hamiltonian various hybrid approximations

http://www.cse.clrc.ac.uk/cmg/CRYSTAL/ HPCx Annual Seminar 10th December 2003 10 Crystal Electronic structure and related properties of periodic systems All electron, local Gaussian basis set, DFT and Hartree-Fock Under continuous development since 1974 Distributed to over 500 sites world wide Developed jointly by Daresbury and the University of Turin HPCx Annual Seminar 10th December 2003 11 Crystal Functionality Basis Set

Properties LCAO - Gaussians Energy All electron or pseudopotential Structure Hamiltonian Vibrations (phonons) Hartree-Fock (UHF, RHF) Elastic tensor DFT (LSDA, GGA) Ferroelectric polarisation Hybrid funcs (B3LYP) Piezoelectric constants Techniques X-ray structure factors Replicated data parallel Density of States / Bands Distributed data parallel Charge/Spin Densities Forces Magnetic Coupling Structural optimization Electrostatics (V, E, EFG classical) Fermi contact (NMR) Direct SCF

EMD (Compton, e-2e) Visualisation AVS GUI (DLV) HPCx Annual Seminar 10th December 2003 12 Benchmark Runs on Crambin Very small protein from Crambe Abyssinica - 1284 atoms per unit cell Initial studies using STO3G (3948 basis functions) Improved to 6-31G * * (12354 functions) All calculations Hartree-Fock As far as we know the largest Hartree-Fock calculation ever converged HPCx Annual Seminar 10th December 2003 13

Scalability of CRYSTAL for crystalline Crambin Ideal 6-31G** IBM p690 6-31G IBM p690 STO-3G IBM p690 6-31G SGI Origin STO-3G SGI Origin Performance (arbitrary) 30 HPCx vs. SGI Origin 20 faster, more stable version of the parallel Jacobi diagonalizer replaces ScaLaPack 10 0 0

256 512 768 1024 Increasing the basis set size increases the scalability Number of Processors HPCx Annual Seminar 10th December 2003 14 Crambin Results Electrostatic Potential Charge density isosurface coloured according to potential Useful to determine possible chemically active groups HPCx Annual Seminar 10th December 2003 15

Futures - Rusticyanin Rusticyanin (Thiobacillus Ferrooxidans) has 6284 atoms (Crambin was 1284) and is involved in redox processes We have just started calculations using over 33000 basis functions In collaboration with S.Hasnain (DL) we want to calculate redox potentials for rusticyanin and associated mutants HPCx Annual Seminar 10th December 2003 16 Materials Science CASTEP CAmbridge Serial Total Energy Package http://www.cse.clrc.ac.uk/cmg/NETWORKS/UKCP/ HPCx Annual Seminar

10th December 2003 17 What is Castep? First principles (DFT) materials simulation code electronic energy geometry optimization surface interactions vibrational spectra materials under pressure, chemical reactions molecular dynamics Method (direct minimization) plane wave expansion of valence electrons pseudopotentials for core electrons HPCx Annual Seminar 10th December 2003 18

Castep 2003 HPCx performance gain J ob time Al2O3 120 atom cell, 5 k- points 8000 7000 6000 5000 4000 3000 2000 1000 0 Bottleneck: Data Traffic in 3D FFT and MPI_AlltoAllV Jan-03 Current 'Best' 80 160

240 320 Total number of processors HPCx Annual Seminar 10th December 2003 19 Castep 2003 HPCx performance gain J ob Time Al2O3 270 atom cell, 2 k- points 16000 14000 12000 10000 8000 6000 4000 2000 0 Jan-03

Current 'Best' 128 256 512 Total number of processors HPCx Annual Seminar 10th December 2003 20 Molecular Simulation AMBER (Assisted Model Building with Energy Refinement) Weiner and Kollman, University of California, 1981 Widely used suite of programs particularly for biomolecules http://amber.scripps.edu/ HPCx Annual Seminar

10th December 2003 21 AMBER - Initial Scaling 12 Speed-up 10 8 6 4 2 0 0 32 64 96 Number of Processors 128 Factor IX protein with Ca++ ions 90906 atoms HPCx Annual Seminar

10th December 2003 22 Current developments - AMBER Bob Duke Developed a new version of Sander on HPCx Originally called AMD (Amber Molecular Dynamics) Renamed PMEMD (Particle Mesh Ewald Molecular Dynamics) Substantial rewrite of the code Converted to Fortran90, removed multiple copies of routines, Likely to be incorporated into AMBER8 We are looking at optimising the collective communications the reduction / scatter HPCx Annual Seminar 10th December 2003 23 Optimisation PMEMD 300 PMEMD Sander7

Time (seconds) 250 200 150 100 50 0 0 32 HPCx Annual Seminar 64 96 128 160 192 Number of Processors 10th December 2003 224 256 24 Atomic and Molecular Physics

PFARM Queens University Belfast, CLRC Daresbury Laboratory R-matrix formalism to treat applications such as the description of the edge region in Tokamak plasmas (fusion power research) and for the interpretation of astrophysical spectra HPCx Annual Seminar 10th December 2003 25 Peigs vs. ScaLapack in PFARM 20000 Peigs total ScaLapack total Peigs diag ScaLapack diag Time (secs) 16000 12000 Bottleneck:

Matrix Diagonalisatio n 8000 4000 0 0 64 128 192 256 Processors HPCx Annual Seminar 10th December 2003 27 ScaLapack diagonalisation on HPCx 300 Dim=7194,PDSYEV

250 Dim=7194,PDSYEVD Time (secs) Dim=3888, PDSYEV 200 Dim=3888,PDSYEVD 150 100 50 0 0 HPCx Annual Seminar 64 128 192 Number of Processors 10th December 2003 256

28 Stage 1 (Sector Diags) on HPCx Sector Hamiltonian matrix size 10032 (x 3 sectors) Time (secs) 4000 Peigs Scalapack D&C Projected Sc'k 3000 2000 1000 0 32 64 128 256 Number of Processors HPCx Annual Seminar

10th December 2003 29 Computational Engineering UK Turbulence Consortium Led by Prof. Neil Sandham, University of Southampton Focus on compute-intensive methods (Direct Numerical Simulation, Large Eddy Simulation, etc) for the simulation of turbulent flows Shock boundary layer interaction modelling - critical for accurate aerodynamic design but still poorly understood http://www.afm.ses.soton.ac.uk/ HPCx Annual Seminar 10th December 2003 30 Direct Numerical Simulation: 3603 benchmark Performance (million iteration points/sec) 40.0

30.0 20.0 10.0 IBM Regatta (ORNL) Cray T3E/1200E IBM Regatta (HPCx) Scaled from 128 CPUs 0.0 0 128 HPCx Annual Seminar 256 384 512 640 Number of processors 10th December 2003 768 896

1024 31 Environmental Science Proudman Oceanographic Laboratory Coastal Ocean Modelling System (POLCOMS) Coupled marine ecosystem modelling http://www.pol.ac.uk/home/research/polcoms/ HPCx Annual Seminar 10th December 2003 32 Coupled Marine Ecosystem Model Irradiation Heat Flux Cloud Cover River Inputs Pelagic Ecosystem Model

Wind Stress o C C, N, P, Si Open Boundary Sediments o C Physical Model HPCx Annual Seminar 10th December 2003 Benthic Model 33 Performance (M grid-points-timesteps/sec) POLCOMS resolution b/m : HPCx 140

120 100 Ideal IBM 1 km IBM 2 km IBM 3 km IBM 6 km IBM 12 km IBM 80 60 40 20 0 0 128 256 384 512 640 768

896 1024 Number of processors HPCx Annual Seminar 10th December 2003 34 Performance (M grid-points-timesteps/sec) POLCOMS 2 km b/m : All systems 120 100 80 Ideal IBM IBM p690 Cray T3E Origin 3800 60 40 20 0 0 128

256 384 512 640 768 896 1024 Number of processors HPCx Annual Seminar 10th December 2003 35 Efficiency of Codes HPCx Annual Seminar 10th December 2003

36 Motivation and Strategy Scalability of Terascale applications is only half the story Absolute performance also depends on Scientific output single cpu performance is the key Percentage of peak is seen as an measure important measure Comparison with other systems e.g. vector machines Run representative test cases on small numbers of processors for applications and some important kernels Use IBMs hpmlib to measure Mflop/s Other hpmlib counters can help to understand performance e.g. memory bandwidth, cache miss rates, FMA count, computational intensity etc. HPCx Annual Seminar 10th December 2003 37 Matrix-matrix multiply kernel 60 50

% of peak 40 30 20 10 0 0 8 16 24 32 Number of processors HPCx Annual Seminar 10th December 2003 38 PCHAN small test case 1203 10 % of peak

1000 % of peak 5 500 Memory bandwidth (MB/s) 1500 Memory bandwidth 0 0 0 32 64 96 128 Number of processors HPCx Annual Seminar 10th December 2003

39 Summary of percentage of peak MXM DIAG PRMAT CASTEP H2MOL GAMESS CRYSTAL NAMD POLCOMS AMBER PCHAN DLPOLY 0 10 20 30 40 50 60

% of peak HPCx Annual Seminar 10th December 2003 40 Acknowledgements HPCx Terascaling Team Mike Ashworth Mark Bull Ian Bush Martyn Guest Joachim Hein David Henty IBM Technical Support

Adrian Jackson Chris Johnson Martin Plummer Gavin Pringle Lorna Smith Kevin Stratford Andrew Sunderland Luigi Brochard et al. CSAR Computing Service Cray T3E turing, 3800 R12k-400 green ORNL IBM Regatta cheetah SARA Origin 3800 R14k-500 PSC AlphaServer SC ES45-1000 HPCx Annual Seminar 10th December 2003

Origin 41 The Reality of Capability Computing on HPCx The success of the Terascaling strategy is shown by the Nov 2003 HPCx usage 256 21.4% Capability 48.0% 128 15.8% 64 7.3% Even without Teragyroid it is 40.7% 8

32 16 0.2% 5.1% 1.9% HPCx Annual Seminar Capability jobs (512+ procs) account for 48% of usage 10th December 2003 42 Summary HPCx Terascaling team is addressing scalability for a wide range of codes Key Strategic Applications Areas Atomic and Molecular Physics, Molecular Simulation, Materials Science, Computational Engineering, Environmental Science Reflected by take up of Capability Computing on HPCx In Nov 03, >40% of time used by jobs with 512 procs and greater Key challenges

Maintain progress with Terascaling Include new applications and new science areas Address efficiency issues esp. with single processor performance Fully exploit the phase 2 system: 1.7 GHz p690+, 32 proc partitions, Federation interconnect HPCx Annual Seminar 10th December 2003 43