HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS Chris Butler
HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS Chris Butler NVIDIA Science is Desperate for Throughput Gigaflops 1,000,000,000 1,000,000 1 Exaflop Bacteria 100s of Chromatophores 1 Petaflop Chromatophore 50M atoms 1,000 Ribosome 2.7M atoms 1 BPTI 3K atoms 1982 NVIDIA Corporation 2009 Estrogen Receptor 36K atoms
1997 F1-ATPase 327K atoms 2003 Ran for 8 months to simulate 2 nanoseconds 2006 2010 2012 Power Crisis in Supercomputing Household Power Equivalent Exaflop City 25,000,000 Watts 7,000,000 Watts Petaflop Town Jaguar Los Alamos 850,000 Watts Teraflop Neighborhood
60,000 Watts Gigaflop Block 1982 NVIDIA Corporation 2009 1996 2008 2020 Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is expected to be 10-times more powerful than todays fastest supercomputer. Since ORNLs Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 Petaflops we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range. September 30 2009 NVIDIA Corporation 2009 What is GPU Computing? x86
PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing NVIDIA Corporation 2009 Low Latency or High Throughput? CPU Optimised for low-latency access to cached data sets Control logic for out-of-order and speculative execution Control NVIDIA Corporation 2009 ALU ALU ALU Cache DRAM GPU Optimised for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation
ALU DRAM NVIDIA GPU Computing Ecosystem CUDA Training Company ISV TPP / OEM CUDA Development Specialist Hardware Architect VAR GPU tur itec Arch e CUDA SDK & Tools Customer Application NVIDIA Hardware Solutions Customer Requirements
Hardware Architecture NVIDIA Corporation 2009 Deployment NVIDIA GPU Product Families GeForce TeslaTM Quadro Entertainment High-Performance Computing Design & Creation NVIDIA Corporation 2009 Many-Core High Performance Computing NVIDIAs 10-series GPU has 240 cores NVIDIA NVIDIA 10-Series GPU Each core has a Floating point / integer unit Logic unit Move, compare unit Branch unit 1.4 billion transistors 1 Teraflop of processing power
240 processing cores Cores managed by thread manager Thread manager can spawn and manage 30,000+ threads Zero overhead thread switching NVIDIA Corporation 2009 NVIDIAs 2nd Generation CUDA Processor Tesla GPU Computing Products SuperMicro 1U GPU SuperServer Tesla S1070 1U System Tesla C1060 Computing Board Tesla Personal Supercomputer GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs Single Precision
1 Tesla T10 Number of cores 240 Core Clock 1.296 GHz Floating Point Performance 78 Gflops Double Precision On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak Memory I/O 512-bit, 800MHz GDDR3 Form factor NVIDIA Corporation 2009 933 Gflops Single Precision Full ATX: 4.736 x 10.5 Dual slot wide
System I/O PCIe 16 Gen2 Typical power 160 W Tesla M1060 Embedded Module Processor 1 Tesla T10 Number of cores 240 Core Clock 1.296 GHz Floating Point Performance OEM-only product Available as integrated product in OEM systems NVIDIA Corporation 2009 933 Gflops Single Precision 78 Gflops Double Precision On-board memory 4.0 GB
Memory bandwidth 102 GB/sec peak Memory I/O 512-bit, 800MHz GDDR3 Form factor Full ATX: 4.736 x 10.5 Dual slot wide System I/O PCIe 16 Gen2 Typical power 160 W Tesla Personal Supercomputer Supercomputing Performance Massively parallel CUDA Architecture 960 cores. 4 Teraflops 250 the performance of a desktop Personal One researcher, one supercomputer Plugs into standard power strip Accessible Program in C for Windows, Linux Available now worldwide under $10,000
NVIDIA Corporation 2009 Tesla S1070 1U System Processors Number of cores Core Clock Performance Total system memory Memory bandwidth 960 1.44 GHz 4 Teraflops 16.0 GB (4.0 GB per T10) 408 GB/sec peak (102 GB/sec per T10) Memory I/O 2048-bit, 800MHz GDDR3 (512-bit per T10) Form factor 1U (EIA 19 rack) System I/O 2 PCIe 16 Gen2 Typical power NVIDIA Corporation 2009 4 Tesla T10 700 W
SuperMicro GPU 1U SuperServer M1060 GPUs Two M1060 GPUs in a 1U Dual Nehalem-EP Xeon CPUs Up to 96 GB DDR3 ECC Onboard Infiniband (QDR) 3 hot-swap 3.5 SATA HDD 1200 W power supply NVIDIA Corporation 2009 CUDA Parallel Computing Architecture GPU GPU Computing Computing Applications Applications CUDA CUDA C C OpenCL OpenCL DirectCompute DirectCompute CUDA CUDA Fortran Fortran Java Java and
and Python Python NVIDIA NVIDIA GPU GPU with with the the CUDA CUDA Parallel Parallel Computing Computing Architecture Architecture NVIDIA Corporation 2009 OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc. NVIDIA CUDA C and OpenCL CUDA C Entry point for developers who want low-level API Shared back-end compiler and optimization technology OpenCL PTX GPU NVIDIA Corporation 2009 Entry point for developers who prefer high-level C
CUDA Zone: www.nvidia.com/CUDA CUDA Toolkit Compiler Libraries CUDA SDK Code samples CUDA Profiler Forums Resources for CUDA developers NVIDIA Corporation 2009 Wide Developer Acceptance and Success 146X 36X Interactive visualization of volumetric white matter connectivity Ion placement for molecular dynamics simulation 149X Financial simulation of LIBOR model with swaptions
NVIDIA Corporation 2009 47X [email protected]: An M-script API for linear Algebra operations on GPU 19X 17X 100X Simulation in Matlab using .mex file CUDA function Astrophysics Nbody simulation 20X 24X 30X Ultrasound medical imaging for cancer diagnostics Highly optimized object oriented molecular dynamics
Cmatch exact string matching to find similar proteins and gene sequences Transcoding HD video stream to H.264 CUDA Co-Processing Ecosystem Over 200 Universities Teaching CUDA UIUC MIT Harvard Berkeley Cambridge Oxford Applications Oil & Gas Finance CFD Medical Biophysics Imaging Numerics DSP
NVIDIA Corporation 2009 EDA IIT Delhi Tsinghua Dortmundt ETH Zurich Moscow NTU Libraries FFT BLAS LAPACK Image processing Video processing Signal processing Vision Languages Compilers C, C++ DirectX Fortran Java OpenCL Python PGI Fortran CAPs HMPP MCUDA
MPI NOAA Fortran2C OpenMP Consultants ANEO GPU Tech OEMs NEXT-GENERATION GPU ARCHITECTURE FERMI NVIDIA Corporation 2009 Introducing the Fermi Architecture DRAM I/F DRAM I/F The Soul of a Supercomputer in the body of a GPU 3 billion transistors Over 2 the cores (512 total) 8 the peak DP performance DRAM I/F DRAM I/F L1 and L2 caches L2 DRAM I/F
Giga Thread HOST I/F DRAM I/F NVIDIA Corporation 2009 ECC ~2 memory bandwidth (GDDR5) Up to 1 Terabyte of GPU memory Concurrent kernels Hardware support for C++ Design Goal of Fermi Data Parallel Instruction Parallel GPU NVIDIA Corporation 2009 Bring more users, more applications to the GPU CPU Many Decisions Expand performance sweet
spot of the GPU Large Data Sets Streaming Multiprocessor Architecture Instruction Cache Scheduler Scheduler Dispatch 32 CUDA cores per SM (512 total) Dispatch Register File Core Core Core Core 8 peak double precision floating point performance Core Core Core Core 50% of peak single precision Core Core Core Core Core Core Core Core Core Core Core Core Dual Thread Scheduler Core Core Core Core Core Core Core Core 64 KB of RAM for shared memory
and L1 cache (configurable) Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache NVIDIA Corporation 2009 CUDA Core Architecture Instruction Cache Scheduler Scheduler Dispatch New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs Dispatch Register File Core Core Core Core Core Core Core Core Fused multiply-add (FMA) instruction for both single and double precision Newly designed integer ALU optimized for 64-bit and extended precision operations Core Core Core Core CUDA Core
Core Core Core Core Dispatch Port Core Core Core Core Operand Collector Core Core Core Core Core Core Core Core FP Unit INT Unit Core Core Core Core Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache NVIDIA Corporation 2009 Cached Memory Hierarchy First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory L1 Cache per SM (32 cores) Improves bandwidth and reduces latency DRAM I/F Giga Thread
DRAM I/F DRAM I/F NVIDIA Corporation 2009 L2 DRAM I/F Parallel DataCache Memory Hierarchy DRAM I/F HOST I/F Fast, coherent data sharing across all cores in the GPU DRAM I/F Unified L2 Cache (768 KB) NVIDIA Corporation 2009 Giga Thread DRAM I/F DRAM I/F Operate on large data sets L2 DRAM I/F
Up to 1 Terabyte of memory attached to GPU DRAM I/F HOST I/F 2 speed of GDDR3 DRAM I/F GDDR5 memory interface DRAM I/F Larger, Faster Memory Interface Error Correcting Code ECC protection for DRAM ECC supported for GDDR5 memory All major internal memories are ECC protected Register file, L1 cache, L2 cache NVIDIA Corporation 2009 GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Kernel 1 Kernel 1 Kernel 2 nel Kernel 2
Time Kernel 2 Kernel 2 Kernel 3 Kernel 5 Kernel 3 Kernel 4 Kernel 5 Serial Kernel Execution NVIDIA Corporation 2009 Parallel Kernel Execution Ker 4 Enhanced Software Support Full C++ Support Virtual functions Try/Catch hardware support System call support Support for pipes, semaphores, printf, etc Unified 64-bit memory addressing NVIDIA Corporation 2009 Introducing Tesla Bio WorkBench TeraChem TeraChem
Application s LAMMPS GPU-AutoDock MUMmerGPU Community Download, Documentation Technical papers Tesla Personal Supercomputer Platforms NVIDIA Corporation 2009 Discussion Forums Benchmarks & Configurations Tesla GPU Clusters Tesla Bio Workbench Applications AMBER (MD) ACEMD (MD) GROMACS (MD) GROMOS (MD) LAMMPS (MD)
0.6 0.52 ns/day 25 0.5 20 0.4 15 7x 0.3 10 5 0 More Info http://www.nvidia.com/object/amber_on_tesla.html ns/ day ns/ day 30 8.6x 0.2 4.04 ns/day
Myoglobin 2492 atoms 2x Intel Quad Core E5462 2.8 GHz Tesla C1060 GPU 0.1 0 0.06 ns/day Nucleosome 25095 atoms 2x Intel Quad Core E5462 2.8 GHz Tesla C1060 GPU Data courtesy of San Diego Supercomputing Center ISV status Developer Application Description Category STATUS ORNL HOMME
High Order Method Modeling Environment Government Work in progress Ames Lab HOOMD Molecular dynamics Life Science Available NCI AutoDock Molecular dynamics Life Science Selected support from third parties, also open source project, need to reach out to them more ORNL MADNESS Computational Chemistry
Life Science Work in progress Smith-Waterman DNA sequencing Life Science Various versions available such as http://gpu.epfl.ch/sw.html Stanford OpenMM Molecular Library Life Science Available with PME in beta Gaussian Gaussian NVIDIA Corporation 2009 Quant Chem Life Science Taking names, no date given ISV Status:
Developer Application Description Category STATUS Harvard and Univ of Delaware CHARMM Molecular dynamics Life Science Cuda support in library as part of alpha build, working with them to get a beta in Feb Howard Hughes Med HMMER Hidden Markov models for bio Life Science Available IA State GAMESS
Quant Chem Life Science Integration in progress, hopeful of beta in Q1 Scripps LAMMPS Molecular dynamics Life Science Selected algorithms supported, project ongoing download availalable Scripps AMBER Molecular dynamics Life Science Amber 10 patch for GB today, PME by Dec/Jan Scripps AutoDock Protien Docking
Life Science Available via 3rd party NVIDIA Corporation 2009 ISV Status: Developer Application Description Category STATUS Stockholm Center GROMACS Molecular dynamics Life Science Available with PME via OpenMM UIUC NAMD Molecular dynamics Life Science Namd 2.7 B2 available
UIUC VMD Viz of MD Life Science Available Univ of Delaware [email protected] Protein docking Life Science Not sure. Univ of Maryland MUMmerGPU DNA sequence alignment Life Science Available Allinea Allinea DDT Linux Debugger
Tools - Debug/Profile Beta this month TotalView TotalView Debugger Linux Debugger Tools - Debug/Profile Beta in Q1 NVIDIA Corporation 2009 GPU Revolutionizing Computing GFlops A 2015 GPU * ~20 the performance of todays GPU ~5,000 cores at ~3 GHz (50 mW each) ~20 TFLOPS ~1.2 TB/s of memory bandwidth GPU Fermi 512 core T8 128 core 2006 T10
240 core 2008 2010 2012 2015 * This is a sketch of a what a GPU in 2015 might look like; it does not reflect any actual product plans. NVIDIA Corporation 2009 NVIDIA Corporation 2009
What intrapersonal data are telling us about challenge, creativity, perseverance, and problem solving. Plans for subsequent work. The Mismatch. High school science tends to be taught as learning facts often relying on rote memorization .
Type of literature 1. Situation S.T.A.R.T. 3. Analyze the passage 3.2 Mark the main themes 3.3 Conclude and summarize the message to the original receiver 3.1 Look out for parallelisms, similis and methaphors, personification and hyperbole Psalm 32 (NIV) 1...
The Components of Culture: Symbols Definition: A symbol is anything that represents something else. Symbols exist for countries… Each of the following symbols is associated with a particular country or region of the world.
Getting Started: CourseMate in Canvas. Course Name: XX. CourseMate will account for X% of your grade in this course.. CourseMate brings course concepts to life with interactive learning, study, and exam preparation tools that support you as you learn the...
Ecordia E-Portfolio Ecordia is about providing the very best e-assessment and e-portfolio experience for ALL users and at a reasonable and realistic cost for training providers who operate in a sector with...
Nathaniel Niles, MD Cardiology Grand Rounds January 13, 2005 Dartmouth-Hitchcock Medical Center Current Management Goals for Treating Acute STEMI Epicardial Flow After Thrombolysis and Mortality Outcomes Trials Examining Facilitated PCI after Thrombolysis vs. Thrombolysis alone Trials Examining Facilitated PCI vs....
WHS Act, the Regulations set out very speciﬁc requirements for this type of work. A Principal Contractor is a PCBU that commissions construction work valued at . $250,000 or more. Section 293, WHS Regulations specify that there can be .
Ready to download the document? Go ahead and hit continue!