HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS Chris Butler

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS Chris Butler NVIDIA Science is Desperate for Throughput Gigaflops 1,000,000,000 1,000,000 1 Exaflop Bacteria 100s of Chromatophores 1 Petaflop Chromatophore 50M atoms 1,000 Ribosome 2.7M atoms 1 BPTI 3K atoms 1982 NVIDIA Corporation 2009 Estrogen Receptor 36K atoms

1997 F1-ATPase 327K atoms 2003 Ran for 8 months to simulate 2 nanoseconds 2006 2010 2012 Power Crisis in Supercomputing Household Power Equivalent Exaflop City 25,000,000 Watts 7,000,000 Watts Petaflop Town Jaguar Los Alamos 850,000 Watts Teraflop Neighborhood

60,000 Watts Gigaflop Block 1982 NVIDIA Corporation 2009 1996 2008 2020 Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is expected to be 10-times more powerful than todays fastest supercomputer. Since ORNLs Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 Petaflops we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range. September 30 2009 NVIDIA Corporation 2009 What is GPU Computing? x86

PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing NVIDIA Corporation 2009 Low Latency or High Throughput? CPU Optimised for low-latency access to cached data sets Control logic for out-of-order and speculative execution Control NVIDIA Corporation 2009 ALU ALU ALU Cache DRAM GPU Optimised for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation

ALU DRAM NVIDIA GPU Computing Ecosystem CUDA Training Company ISV TPP / OEM CUDA Development Specialist Hardware Architect VAR GPU tur itec Arch e CUDA SDK & Tools Customer Application NVIDIA Hardware Solutions Customer Requirements

Hardware Architecture NVIDIA Corporation 2009 Deployment NVIDIA GPU Product Families GeForce TeslaTM Quadro Entertainment High-Performance Computing Design & Creation NVIDIA Corporation 2009 Many-Core High Performance Computing NVIDIAs 10-series GPU has 240 cores NVIDIA NVIDIA 10-Series GPU Each core has a Floating point / integer unit Logic unit Move, compare unit Branch unit 1.4 billion transistors 1 Teraflop of processing power

240 processing cores Cores managed by thread manager Thread manager can spawn and manage 30,000+ threads Zero overhead thread switching NVIDIA Corporation 2009 NVIDIAs 2nd Generation CUDA Processor Tesla GPU Computing Products SuperMicro 1U GPU SuperServer Tesla S1070 1U System Tesla C1060 Computing Board Tesla Personal Supercomputer GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs Single Precision

Performance 1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops Double Precision Performance 156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU) NVIDIA Corporation 2009 Tesla C1060 Computing Processor Processor

1 Tesla T10 Number of cores 240 Core Clock 1.296 GHz Floating Point Performance 78 Gflops Double Precision On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak Memory I/O 512-bit, 800MHz GDDR3 Form factor NVIDIA Corporation 2009 933 Gflops Single Precision Full ATX: 4.736 x 10.5 Dual slot wide

System I/O PCIe 16 Gen2 Typical power 160 W Tesla M1060 Embedded Module Processor 1 Tesla T10 Number of cores 240 Core Clock 1.296 GHz Floating Point Performance OEM-only product Available as integrated product in OEM systems NVIDIA Corporation 2009 933 Gflops Single Precision 78 Gflops Double Precision On-board memory 4.0 GB

Memory bandwidth 102 GB/sec peak Memory I/O 512-bit, 800MHz GDDR3 Form factor Full ATX: 4.736 x 10.5 Dual slot wide System I/O PCIe 16 Gen2 Typical power 160 W Tesla Personal Supercomputer Supercomputing Performance Massively parallel CUDA Architecture 960 cores. 4 Teraflops 250 the performance of a desktop Personal One researcher, one supercomputer Plugs into standard power strip Accessible Program in C for Windows, Linux Available now worldwide under $10,000

NVIDIA Corporation 2009 Tesla S1070 1U System Processors Number of cores Core Clock Performance Total system memory Memory bandwidth 960 1.44 GHz 4 Teraflops 16.0 GB (4.0 GB per T10) 408 GB/sec peak (102 GB/sec per T10) Memory I/O 2048-bit, 800MHz GDDR3 (512-bit per T10) Form factor 1U (EIA 19 rack) System I/O 2 PCIe 16 Gen2 Typical power NVIDIA Corporation 2009 4 Tesla T10 700 W

SuperMicro GPU 1U SuperServer M1060 GPUs Two M1060 GPUs in a 1U Dual Nehalem-EP Xeon CPUs Up to 96 GB DDR3 ECC Onboard Infiniband (QDR) 3 hot-swap 3.5 SATA HDD 1200 W power supply NVIDIA Corporation 2009 CUDA Parallel Computing Architecture GPU GPU Computing Computing Applications Applications CUDA CUDA C C OpenCL OpenCL DirectCompute DirectCompute CUDA CUDA Fortran Fortran Java Java and

and Python Python NVIDIA NVIDIA GPU GPU with with the the CUDA CUDA Parallel Parallel Computing Computing Architecture Architecture NVIDIA Corporation 2009 OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc. NVIDIA CUDA C and OpenCL CUDA C Entry point for developers who want low-level API Shared back-end compiler and optimization technology OpenCL PTX GPU NVIDIA Corporation 2009 Entry point for developers who prefer high-level C

CUDA Zone: www.nvidia.com/CUDA CUDA Toolkit Compiler Libraries CUDA SDK Code samples CUDA Profiler Forums Resources for CUDA developers NVIDIA Corporation 2009 Wide Developer Acceptance and Success 146X 36X Interactive visualization of volumetric white matter connectivity Ion placement for molecular dynamics simulation 149X Financial simulation of LIBOR model with swaptions

NVIDIA Corporation 2009 47X [email protected]: An M-script API for linear Algebra operations on GPU 19X 17X 100X Simulation in Matlab using .mex file CUDA function Astrophysics Nbody simulation 20X 24X 30X Ultrasound medical imaging for cancer diagnostics Highly optimized object oriented molecular dynamics

Cmatch exact string matching to find similar proteins and gene sequences Transcoding HD video stream to H.264 CUDA Co-Processing Ecosystem Over 200 Universities Teaching CUDA UIUC MIT Harvard Berkeley Cambridge Oxford Applications Oil & Gas Finance CFD Medical Biophysics Imaging Numerics DSP

NVIDIA Corporation 2009 EDA IIT Delhi Tsinghua Dortmundt ETH Zurich Moscow NTU Libraries FFT BLAS LAPACK Image processing Video processing Signal processing Vision Languages Compilers C, C++ DirectX Fortran Java OpenCL Python PGI Fortran CAPs HMPP MCUDA

MPI NOAA Fortran2C OpenMP Consultants ANEO GPU Tech OEMs NEXT-GENERATION GPU ARCHITECTURE FERMI NVIDIA Corporation 2009 Introducing the Fermi Architecture DRAM I/F DRAM I/F The Soul of a Supercomputer in the body of a GPU 3 billion transistors Over 2 the cores (512 total) 8 the peak DP performance DRAM I/F DRAM I/F L1 and L2 caches L2 DRAM I/F

Giga Thread HOST I/F DRAM I/F NVIDIA Corporation 2009 ECC ~2 memory bandwidth (GDDR5) Up to 1 Terabyte of GPU memory Concurrent kernels Hardware support for C++ Design Goal of Fermi Data Parallel Instruction Parallel GPU NVIDIA Corporation 2009 Bring more users, more applications to the GPU CPU Many Decisions Expand performance sweet

spot of the GPU Large Data Sets Streaming Multiprocessor Architecture Instruction Cache Scheduler Scheduler Dispatch 32 CUDA cores per SM (512 total) Dispatch Register File Core Core Core Core 8 peak double precision floating point performance Core Core Core Core 50% of peak single precision Core Core Core Core Core Core Core Core Core Core Core Core Dual Thread Scheduler Core Core Core Core Core Core Core Core 64 KB of RAM for shared memory

and L1 cache (configurable) Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache NVIDIA Corporation 2009 CUDA Core Architecture Instruction Cache Scheduler Scheduler Dispatch New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs Dispatch Register File Core Core Core Core Core Core Core Core Fused multiply-add (FMA) instruction for both single and double precision Newly designed integer ALU optimized for 64-bit and extended precision operations Core Core Core Core CUDA Core

Core Core Core Core Dispatch Port Core Core Core Core Operand Collector Core Core Core Core Core Core Core Core FP Unit INT Unit Core Core Core Core Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache NVIDIA Corporation 2009 Cached Memory Hierarchy First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory L1 Cache per SM (32 cores) Improves bandwidth and reduces latency DRAM I/F Giga Thread

DRAM I/F DRAM I/F NVIDIA Corporation 2009 L2 DRAM I/F Parallel DataCache Memory Hierarchy DRAM I/F HOST I/F Fast, coherent data sharing across all cores in the GPU DRAM I/F Unified L2 Cache (768 KB) NVIDIA Corporation 2009 Giga Thread DRAM I/F DRAM I/F Operate on large data sets L2 DRAM I/F

Up to 1 Terabyte of memory attached to GPU DRAM I/F HOST I/F 2 speed of GDDR3 DRAM I/F GDDR5 memory interface DRAM I/F Larger, Faster Memory Interface Error Correcting Code ECC protection for DRAM ECC supported for GDDR5 memory All major internal memories are ECC protected Register file, L1 cache, L2 cache NVIDIA Corporation 2009 GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Kernel 1 Kernel 1 Kernel 2 nel Kernel 2

Time Kernel 2 Kernel 2 Kernel 3 Kernel 5 Kernel 3 Kernel 4 Kernel 5 Serial Kernel Execution NVIDIA Corporation 2009 Parallel Kernel Execution Ker 4 Enhanced Software Support Full C++ Support Virtual functions Try/Catch hardware support System call support Support for pipes, semaphores, printf, etc Unified 64-bit memory addressing NVIDIA Corporation 2009 Introducing Tesla Bio WorkBench TeraChem TeraChem

Application s LAMMPS GPU-AutoDock MUMmerGPU Community Download, Documentation Technical papers Tesla Personal Supercomputer Platforms NVIDIA Corporation 2009 Discussion Forums Benchmarks & Configurations Tesla GPU Clusters Tesla Bio Workbench Applications AMBER (MD) ACEMD (MD) GROMACS (MD) GROMOS (MD) LAMMPS (MD)

NAMD (MD) TeraChem (QC) VMD (Visualization MD & QC) NVIDIA Corporation 2009 Docking GPU AutoDock Sequence analysis CUDASW++ (SmithWaterman) MUMmerGPU GPU-HMMER CUDA-MEME Motif Discovery AMBER Molecular Dynamics Alpha now Q1 2010 Q2 2010 Generalized Born PME: Particle Mesh Ewald Beta release Multi-GPU + MPI support Beta 2 release Implicit solvent GB results 1 Tesla GPU 8x faster than 2 quad-core CPUs Generalized Born Simulations NVIDIA Corporation 2009 27.72 ns/day

0.6 0.52 ns/day 25 0.5 20 0.4 15 7x 0.3 10 5 0 More Info http://www.nvidia.com/object/amber_on_tesla.html ns/ day ns/ day 30 8.6x 0.2 4.04 ns/day

Myoglobin 2492 atoms 2x Intel Quad Core E5462 2.8 GHz Tesla C1060 GPU 0.1 0 0.06 ns/day Nucleosome 25095 atoms 2x Intel Quad Core E5462 2.8 GHz Tesla C1060 GPU Data courtesy of San Diego Supercomputing Center ISV status Developer Application Description Category STATUS ORNL HOMME

High Order Method Modeling Environment Government Work in progress Ames Lab HOOMD Molecular dynamics Life Science Available NCI AutoDock Molecular dynamics Life Science Selected support from third parties, also open source project, need to reach out to them more ORNL MADNESS Computational Chemistry

Life Science Work in progress Smith-Waterman DNA sequencing Life Science Various versions available such as http://gpu.epfl.ch/sw.html Stanford OpenMM Molecular Library Life Science Available with PME in beta Gaussian Gaussian NVIDIA Corporation 2009 Quant Chem Life Science Taking names, no date given ISV Status:

Developer Application Description Category STATUS Harvard and Univ of Delaware CHARMM Molecular dynamics Life Science Cuda support in library as part of alpha build, working with them to get a beta in Feb Howard Hughes Med HMMER Hidden Markov models for bio Life Science Available IA State GAMESS

Quant Chem Life Science Integration in progress, hopeful of beta in Q1 Scripps LAMMPS Molecular dynamics Life Science Selected algorithms supported, project ongoing download availalable Scripps AMBER Molecular dynamics Life Science Amber 10 patch for GB today, PME by Dec/Jan Scripps AutoDock Protien Docking

Life Science Available via 3rd party NVIDIA Corporation 2009 ISV Status: Developer Application Description Category STATUS Stockholm Center GROMACS Molecular dynamics Life Science Available with PME via OpenMM UIUC NAMD Molecular dynamics Life Science Namd 2.7 B2 available

UIUC VMD Viz of MD Life Science Available Univ of Delaware [email protected] Protein docking Life Science Not sure. Univ of Maryland MUMmerGPU DNA sequence alignment Life Science Available Allinea Allinea DDT Linux Debugger

Tools - Debug/Profile Beta this month TotalView TotalView Debugger Linux Debugger Tools - Debug/Profile Beta in Q1 NVIDIA Corporation 2009 GPU Revolutionizing Computing GFlops A 2015 GPU * ~20 the performance of todays GPU ~5,000 cores at ~3 GHz (50 mW each) ~20 TFLOPS ~1.2 TB/s of memory bandwidth GPU Fermi 512 core T8 128 core 2006 T10

240 core 2008 2010 2012 2015 * This is a sketch of a what a GPU in 2015 might look like; it does not reflect any actual product plans. NVIDIA Corporation 2009 NVIDIA Corporation 2009

Recently Viewed Presentations

  • Network on Intrapersonal Research in Education Advancing ...

    Network on Intrapersonal Research in Education Advancing ...

    What intrapersonal data are telling us about challenge, creativity, perseverance, and problem solving. Plans for subsequent work. The Mismatch. High school science tends to be taught as learning facts often relying on rote memorization .
  • Réussir son stage - Université Laval

    Réussir son stage - Université Laval

    Zone de service SPLA en foresterie, géographie et géomatique . Pavillon Abitibi-Price, local 1250e2405, rue de la TerrasseQuébec (QC) G1V 0A6418 656-3575 . [email protected] www.spla.ulaval.ca
  • Agenda: Narratives 1. How to interpret Poetry 2.

    Agenda: Narratives 1. How to interpret Poetry 2.

    Type of literature 1. Situation S.T.A.R.T. 3. Analyze the passage 3.2 Mark the main themes 3.3 Conclude and summarize the message to the original receiver 3.1 Look out for parallelisms, similis and methaphors, personification and hyperbole Psalm 32 (NIV) 1...
  • The Components of Culture: Symbols - Weebly

    The Components of Culture: Symbols - Weebly

    The Components of Culture: Symbols Definition: A symbol is anything that represents something else. Symbols exist for countries… Each of the following symbols is associated with a particular country or region of the world.
  • FileNewTemplate

    FileNewTemplate

    Getting Started: CourseMate in Canvas. Course Name: XX. CourseMate will account for X% of your grade in this course.. CourseMate brings course concepts to life with interactive learning, study, and exam preparation tools that support you as you learn the...
  • Ecordia E-Portfolio

    Ecordia E-Portfolio

    Ecordia E-Portfolio Ecordia is about providing the very best e-assessment and e-portfolio experience for ALL users and at a reasonable and realistic cost for training providers who operate in a sector with...
  • Acute Myocardial Infarction: Results from the DHMC Regional

    Acute Myocardial Infarction: Results from the DHMC Regional

    Nathaniel Niles, MD Cardiology Grand Rounds January 13, 2005 Dartmouth-Hitchcock Medical Center Current Management Goals for Treating Acute STEMI Epicardial Flow After Thrombolysis and Mortality Outcomes Trials Examining Facilitated PCI after Thrombolysis vs. Thrombolysis alone Trials Examining Facilitated PCI vs....
  • Whilst the model WHS Act imposes a primary

    Whilst the model WHS Act imposes a primary

    WHS Act, the Regulations set out very specific requirements for this type of work. A Principal Contractor is a PCBU that commissions construction work valued at . $250,000 or more. Section 293, WHS Regulations specify that there can be .