ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2015

ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2015 Performance of a Computer (Chapter 4) Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 http://www.eng.auburn.edu/~vagrawal [email protected] Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 1 What is Performance? Response time: the time between the start and completion of a task. Throughput: the total amount of work done in a given time. Some performance measures: MIPS (million instructions per second). MFLOPS (million floating point operations per second), also GFLOPS, TFLOPS (1012), etc. SPEC (System Performance Evaluation Corporation) benchmarks. LINPACK benchmarks, floating point computing, used for

supercomputers. Synthetic benchmarks. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 2 Small and Large Numbers 10-3 10-6 10-9 10-12 10-15 10-18 10-21 10-24 Spr 2015, Apr 13 . . . Small milli micro nano pico femto atto zepto yocto

m n p f 103 106 109 1012 1015 1018 1021 1024 ELEC 5200-001/6200-001 Lecture 10 Large kilo mega giga tera peta exa zetta yotta k M

G T P 3 Computer Memory Size Number bits bytes 210 1,024 K Kb KB 220 1,048,576 M Mb MB

230 1,073,741,824 G Gb GB 240 1,099,511,627,776 T Tb TB Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 4 Units for Measuring Performance Time in seconds (s), microseconds (s), nanoseconds (ns), or picoseconds (ps). Clock cycle Period of the hardware clock Example: one clock cycle means 1 nanosecond for a 1GHz clock frequency (or 1GHz clock rate)

CPU time = (CPU clock cycles)/(clock rate) Cycles per instruction (CPI): average number of clock cycles used to execute a computer instruction. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 5 Components of Performance Components of Performance Units CPU time for a program Time (seconds, etc.) Instruction count CPI Clock cycle time Spr 2015, Apr 13 . . . Instructions executed by the program Average number of clock cycles per

instruction Time period of clock (seconds, etc.) ELEC 5200-001/6200-001 Lecture 10 6 Time, While You Wait, or Pay For CPU time is the time taken by CPU to execute the program. It has two components: User CPU time is the time to execute the instructions of the program. System CPU time is the time used by the operating system to run the program. Elapsed time (wall clock time) is the time between the start and end of a program. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 7 12.9s 2:39 65%

Elapsed time In min:sec CPU time as percent of elapsed time User CPU time in seconds 90.7u System CPU time in seconds Example: Unix time Command 90.7 + 12.9 100 = 65% 159 Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 8 Computing CPU Time CPU time

Spr 2015, Apr 13 . . . = Instruction count CPI Clock cycle time = Instruction count CPI Clock rate = Instructions Clock cycles 1 second Program Instruction Clock rate ELEC 5200-001/6200-001 Lecture 10 9 Comparing Computers C1 and C2 Run the same program on C1 and C2. Suppose both computers execute the same number ( N ) of instructions:

C1: CPI = 2.0, clock cycle time = 1 ns CPU time(C1) = N 2.0 1 = 2.0N ns C2: CPI = 1.2, clock cycle time = 2 ns CPU time(C2) = N 1.2 2 = 2.4N ns CPU time(C2)/CPU time(C1) = 2.4N/2.0N = 1.2, therefore, C1 is 1.2 times faster than C2. Result can vary with the choice of program. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 10 Comparing Program Codes I & II Code size for a program: Instr. Type CPI A 1

B 2 C 3 Code I has 5 million instructions Code II has 6 million instructions Code I is more efficient. Is it? Suppose a computer has three types of instructions: A, B and C. CPU cycles (code I) = 10 million CPU cycles (code II) = 9 million Code II is more efficient. CPI( I ) = 10/5 = 2 CPI( II ) = 9/6 = 1.5 Code II is more efficient. Caution: Code size is a misleading indicator of performance. Spr 2015, Apr 13 . . . Instruction count in million Code

Type Type Type Total A B C I 2 1 2 5 II 4 1 1 6 ELEC 5200-001/6200-001 Lecture 10 11

Rating of a Computer MIPS: million instructions per second Instruction count of a program MIPS = Execution time 106 MIPS rating of a computer is relative to a program. Standard programs for performance rating: Synthetic benchmarks SPEC benchmarks (System Performance Evaluation Corporation) Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 12 Synthetic Benchmark Programs Artificial programs that emulate a large set of typical real programs. Whetstone benchmark Algol and Fortran. Dhrystone benchmark Ada and C. Disadvantages: No clear agreement on what a typical instruction mix should be. Benchmarks do not produce meaningful result. Purpose of rating is defeated when compilers are written to optimize the performance rating.

Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 13 Ada Lady Augusta Ada Byron, Countess of Lovelace (1815-1852), daughter of Lord Byron (the poet who spent some time in a Swiss jail in Chillon, not too far from Lausanne...). She was the assistant and patron of Charles Babbage; she wrote programs for his Analytical Engine. An original print from its time. http://www.cs.kuleuven.ac.be/~dirk/ada-belgium/pictures.html Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 14 Misleading Compilers Consider a computer with a clock rate of 1 GHz.

Two compilers produce the following instruction mixes for a program: Instruction count (billions) Code from Type Type Type Total A B C Compiler 1 5 1 1 7 CPU clock cycles CPI 10109 1.43

CPU time* MIPS** (seconds) 10 700 Compiler 2 10 1 1 12 15109 1.25 15 Instruction types A: 1-cycle, B: 2-cycle, C: 3-cycle 800 * CPU time = CPU clock cycles/clock rate ** MIPS = (Total instruction count/CPU time) 10 6 Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 15 Peak and Relative MIPS Ratings Peak MIPS Choose an instruction mix to minimize CPI

The rating can be too high and unrealistic for general programs Relative MIPS: Use a reference computer system Time(ref) Relative MIPS = Time MIPS(ref) Historically, VAX-11/ 780, believed to have a 1 MIPS performance, was used as reference. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 16 Wbopdia on MIPS Acronym for million instructions per second. An old measure of a computer's speed and power, MIPS measures roughly the number of machine instructions that a computer can execute in one second. In fact, some people jokingly claim that MIPS really stands for Meaningless Indicator of Performance. Despite these problems, a MIPS rating can give you a general idea of a computer's speed. The IBM PC/XT computer, for example, is rated at MIPS, while Pentium-based PCs run at over 100 MIPS.

Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 17 A 1994 MIPS Rating Chart MIPS 1975 IBM mainframe 1976 Cray-1 1979 DEC VAX 1981 IBM PC 1984 Sun 2 1994 Pentium PC 1995 Sony PCX video game 1995 Microunity set-top Spr 2015, Apr 13 . . . Price $/MIPS 10 $10M 1M

160 $20M 125K 1 $200K 200K 0.25 $3K 12K 1 $10K 10K 66 $3K

46 500 $500 1 1,000 $500 0.5 ELEC 5200-001/6200-001 Lecture 10 New York Times, April 20, 1994 Computer 18 Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 19 MFLOPS (megaFLOPS)

MFLOPS = Number of floating-point operations in a program Execution time 106 Only floating point operations are counted: Float, real, double; add, subtract, multiply, divide MFLOPS rating is relevant in scientific computing. For example, programs like a compiler will measure almost 0 MFLOPS. Sometimes misleading due to different implementations. For example, a computer that does not have a floating-point divide, will register many FLOPS for a division. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 20 Supercomputer Performance Petaflops Teraflops Gigaflops Megaflops Spr 2015, Apr 13 . . .

ELEC 5200-001/6200-001 Lecture 10 http://en.wikipedia.org/wiki/Supercomputer Exaflops 21 Top Supercomputers, Nov 2014 www.top500.org Max. Pflops Power MW Pflops /MJoule Rank Name Location Cores Clock

GHz 1 Tianhe-2 Guangzhou China 3,120,000 2.2 33.86 17.80 1.90 2 Titan Cray USA 560,460 2.2

17.59 8.21 2.14 3 Sequoia IBM USA 1,572,864 1.6 17.17 7.89 2.18 4 K computer Fujitsu Japan

705,024 2.0 10.50 12.66 0.83 5 Mira IBM USA 786,432 1.6 8.59 3.95 2.18 N. Leavitt, Big Iron Moves Toward Exascale Computing, Computer, vol. 45, no. 11, pp. 14-17, Nov. 2012.

Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 22 The Future Erik P. DeBenedictis of Sandia National Laboratories theorizes that a zettaflops (1021) (one sextillion FLOPS) computer is required to accomplish full weather modeling, which could cover a two week time span accurately. Such systems might be built around 2030. http://en.wikipedia.org/wiki/Supercomputer Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 23 Performance Performance is measured for a given program or a set of programs: n Av. execution time = (1/n) Execution time (program i ) i =1

or n Av. execution time = [ Execution time (program i ) ]1/n i =1 Performance is inverse of execution time: Performance = 1/(Execution time) Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 24 Geometric vs. Arithmetic Mean Reference computer times of n programs: r1, . . . , rn Times of n programs on the computer under evaluation: T1, . . . , Tn Normalized times: T1/r1, . . . , Tn/rn Geometric mean = {(T1/r1) . . . (Tn/rn)}1/n {T1 . . . Tn}1/n = Used {r1 . . . rn}1/n Arithmetic mean = {(T1/r1)+ . . . +(Tn/rn)}/n {T1+ . . . +Tn}/n Not used

{r1+ . . . +rn}/n J. E. Smith, Characterizing Computer Performance with a Single Number, Comm. ACM, vol. 31, no. 10, pp. 1202-1206, Oct. 1988. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 25 SPEC Benchmarks System Performance Evaluation Corporation (SPEC) SPEC89 10 programs SPEC performance ratio relative to VAX-11/780 One program, matrix300, dropped because compilers could be engineered to improve its performance. www.spec.org Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 26 SPEC89 Performance Ratio for IBM Powerstation 550 800 700

600 500 400 compiler enhanced compiler 300 200 Spr 2015, Apr 13 . . . tomcatv fpppp matrix300 eqntott li nasa7 docuc spice espresso

0 gcc 100 ELEC 5200-001/6200-001 Lecture 10 27 SPEC95 Benchmarks Eight integer and ten floating point programs, SPECint95 and SPECfp95. Each program run time is normalized with respect to the run time of Sun SPARCstation 10/40 the ratio is called SPEC ratio. SPECint95 and SPECfp95 summary measurements are the geometric means of SPEC ratios. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 28 SPEC CPU2000 Benchmarks Twelve integer and 14 floating point programs, CINT2000 and CFP2000.

Each program run time is normalized to obtain a SPEC ratio with respect to the run time on Sun Ultra 5_10 with a 300MHz processor. CINT2000 and CFP2000 summary measurements are the geometric means of SPEC ratios. Retired in 2007, replaced with SPEC CPUTM 2006 https://www.spec.org/cpu2006/ Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 29 CINT2000 : Eleven Programs Name 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf Ref Time

1400 1400 1100 1800 1000 1800 1300 1800 1100 1900 1500 3000 Remarks Data compression utility (C) FPGA circuit placement and routing (C) C compiler (C) Minimum cost network flow solver (C) Chess program (C) Natural language processing (C) Ray tracing (C++) Perl (C) Computational group theory (C) Object Oriented Database (C) Data compression utility (C) Place and route simulator (C) https://www.spec.org/cpu2000/docs/readme1st.html#Q8 Spr 2015, Apr 13 . . .

ELEC 5200-001/6200-001 Lecture 10 30 CFP2000: Fourteen Programs (6 Fortran 77, 4 Fortran 90, 4 C) Name 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi Ref Time Remarks 1600 Quantum chromodynamics 3100 Shallow water modeling 1800 Multi-grid solver in 3D potential field 2100 Parabolic/elliptic partial differential equations 1400 3D Graphics library

2900 Fluid dynamics: analysis of oscillatory instability 2600 Neural network simulation; adaptive resonance theory 1300 Finite element simulation; earthquake modeling 1900 Computer vision: recognizes faces 2200 Computational chemistry 2000 Number theory: primality testing 2100 Finite element crash simulation 1100 Particle accelerator model 2600 Solves problems regarding temperature, wind, velocity and distribution of pollutants https://www.spec.org/cpu2000/docs/readme1st.html#Q8 Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 31 Reference CPU: Sun Ultra 5_10 300MHz Processor 3500 CPU seconds 3000 2500 2000 1500

CINT2000 CFP2000 1000 0 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf wupwise swim mgrid applu mesa galgel art equake facerec ammp

lucas fma3d sixtrack apsi 500 Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 32 CINT2000: 3.4GHz Pentium 4, HT Technology (D850MD Motherboard) SPECint2000_base = 1341 SPECint2000 = 1389 2500 2000 1500 Base ratio Opt. ratio 1000 Spr 2015, Apr 13 . . .

twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr 0 gzip 500 ELEC 5200-001/6200-001 Lecture 10

Source: www.spec.org 33 Two Benchmark Results Baseline: A uniform configuration not optimized for specific program: Same compiler with same settings and flags used for all benchmarks Other restrictions Peak: Run is optimized for obtaining the peak performance for each benchmark program. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 34 CINT2000: 1.7GHz Pentium 4 (D850MD Motherboard) Spr 2015, Apr 13 . . . twolf bzip2

vortex gap perlbmk eon parser crafty mcf gcc vpr Base ratio Opt. ratio gzip 1000 900 800 700 600

500 400 300 200 100 0 SPECint2000_base = 579 SPECint2000 = 588 ELEC 5200-001/6200-001 Lecture 10 Source: www.spec.org 35 CFP2000: 1.7GHz Pentium 4 (D850MD Motherboard) SPECfp2000_base = 648 SPECfp2000 = 659 1400 1200 1000 800 600 Base ratio Opt. ratio

400 0 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi 200 Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 Source: www.spec.org 36 Additional SPEC Benchmarks

SPECweb99: measures the performance of a computer in a networked environment. Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: 1/(Execution time) Energy efficiency = Power in watts = Program units/joule Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 37 Energy Efficiency Efficiency averaged on n benchmark programs: n Efficiency = ( Efficiencyi )1/n i =1 where Efficiencyi is the efficiency for program i. Relative efficiency: Efficiency of a computer

Relative efficiency = Eff. of reference computer Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 38 SPEC2000 Relative Energy Efficiency 6 5 Pentium M @1.6/0.6GHz Energyefficient procesor Pentium 4-M @2.4GHz (Reference) 4 3 2 1 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000

SPECFP2000 SPECINT2000 0 Pentium III-M @1.2GHz Always Laptop Min. power max. clock adaptive clk. min. clock Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 39 Energy and Time Perspectives Clock cycle is the unit of computing work. Cycle rate, f cycles per second f is the rate of doing computing work Hardware speed, similar to mph for a car Cycle efficiency, cycles per joule is the computing work per energy unit Hardware efficiency, similar to mpg for a car

Results from recent work: A. Shinde, Managing Performance and Efficiency of a Processor, MEE Project Report, Auburn Univ., Dec. 2012. A. Shinde and V. D. Agarwal, Managing Performance and Efficiency of a Processor, Proc. 45th IEEE Southeastern Symposium on System Theory, Baylor Univ., TX, March 2013, pp. 59-62. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 40 Energy/Cycle for an 8-bit Adder in 90nm CMOS Technology (PTM) K. Kim, Ultra Low Power CMOS Design PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 41 Delay of an 8-bit Adder in 90nm CMOS Technology (PTM) K. Kim, Ultra Low Power CMOS Design PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. Spr 2015, Apr 13 . . .

ELEC 5200-001/6200-001 Lecture 10 42 Pentium M Processor Published data: H. Hanson, K. Rajamani, S. Keckler, F. Rawson, S. Ghiasi, J. Rubio, Thermal Response to DVFS: Analysis with an Intel Pentium M, Proc. International Symp. Low Power Electronics and Design, 2007, pp. 219-224. VDD = 1.2V Maximum clock rate = 1.8GHz Critical path delay, td = 1/1.8GHz = 555.56ps Power consumption = 120W Energy per cycle, EPC = 120/(1.8GHz) = 66.67nJ Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 43 Cycle Efficiency and Frequency for Pentium M Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10

44 Example of Power Management For a program that executes in 1.8 billion clock cycles. Voltage VDD Frequency f MHz Cycle Efficiency, Execution Time second Total Energy Consumed Power f/ 1.2 V

1800 megacycles/s 15 megacycles/joule 1.0 120 Joules 120W 0.6 V 277 megacycles/s 70 megacycles/joule 6.5 25.7 Joules 3.96W 200 mV

54.5 megacycles/s 660 megacycles/joule 33 2.73 Joules 0.083W Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 45 Ways of Improving Performance Increase clock rate. Improve processor organization for lower CPI Pipelining Instruction-level parallelism (ILP): MIMD (Scalar) Data-parallelism: SIMD (Vector) multiprocessing Compiler enhancements that lower the instruction count or generate instructions with lower average CPI (e.g., by using simpler instructions). Spr 2015, Apr 13 . . .

ELEC 5200-001/6200-001 Lecture 10 46 Limits of Performance Execution time of a program on a computer is 100 s: 80 s for multiply operations 20 s for other operations Improve multiply n times: 80 Execution time = ( + 20 ) seconds n Limit: Even if n = , execution time cannot be reduced below 20 s. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 47 Amdahls Law The execution time of a system,

in general, has two fractions a fraction fenh that can be speeded up by factor n, and the remaining fraction 1 - fenh that cannot be improved. Thus, the possible speedup is: G. M. Amdahl, Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, Proc. AFIPS Spring Joint Computer Conf., Atlantic City, NJ, April 1967, pp. 483-485. Spr 2015, Apr 13 . . . Old time Speedup = New time 1 = 1 fenh + fenh/n Gene Myron Amdahl born 1922 http://en.wikipedia.org/wiki/Gene_Amdahl ELEC 5200-001/6200-001 Lecture 10

48 Wisconsin Integrally Synchronized Computer (WISC), 1950-51 Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 49 Parallel Processors: Shared Memory P P M P Spr 2015, Apr 13 . . . P P P ELEC 5200-001/6200-001 Lecture 10

50 Parallel Processors Shared Memory, Infinite Bandwidth N processors Single processor: non-memory execution time = Memory access time = 1 N processor run time, T(N)= 1 + /NN T(1) 1 N Speedup = = = T(N) 1 + /NN (1 )N + Maximum speedup = 1/N(1 ), when N = Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 51 Normalized run time, T(N) Run Time T(N) = 1 + /NN

/N 1 1 Spr 2015, Apr 13 . . . 2 3 4 5 6 Number of processors (N) ELEC 5200-001/6200-001 Lecture 10 7 52 Speedup 6 Speedup, T(1)/T(N) 5 Ideal, N ( = 1) 4

3 N (1 )N + 2 1 1 Spr 2015, Apr 13 . . . 2 3 4 5 6 Number of processors (N) ELEC 5200-001/6200-001 Lecture 10 53 Example 10% memory accesses, i.e., = 0.9 Maximum speedup= 1/N(1 a) = 1.0/N0.1 = 10, when N = What is the speedup with 10 processors? Spr 2015, Apr 13 . . .

ELEC 5200-001/6200-001 Lecture 10 54 Parallel Processors Shared Memory, Finite Bandwidth N processors Single processor: non-memory execution time = Memory access time = (1 )N N processor run time, T(N) = (1 )N + /NN Speedup Spr 2015, Apr 13 . . . = 1 N = (1 )N + /NN (1 )N2 + ELEC 5200-001/6200-001 Lecture 10 55 Normalized run time, T(N)

Run Time T(N) = (1 )N + /NN (1 )N /N 1 1 Spr 2015, Apr 13 . . . 2 3 4 5 6 Number of processors (N) ELEC 5200-001/6200-001 Lecture 10 7 56 Minimum Run Time Minimize N processor run time,

T(N) = (1 )N + /NN T(N)/N = 0 1 /NN2 = 0, N = [/N(1 )] Min. T(N) = 2[(1 )], because 2T(N)/N2 > 0. Maximum speedup = 1/T(N) = 0.5[(1 )]- Example: = 0.9 Maximum speedup = 1.67, when N = 3 Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 57 Speedup 6 Speedup, T(1)/T(N) 5 Ideal, N 4 3 N (1 )N2 + 2

1 1 Spr 2015, Apr 13 . . . 2 3 4 5 6 Number of processors (N) ELEC 5200-001/6200-001 Lecture 10 58 Parallel Processors: Distributed Memory M M P Interconnectio n network P M

Spr 2015, Apr 13 . . . P P M P P ELEC 5200-001/6200-001 Lecture 10 M M 59 Parallel Processors Distributed Memory N processors Single processor: non-memory execution time = Memory access time = 1 , same as single processor Communication overhead = (N 1) N processor run time, T(N) = (N 1) + 1/NN

Speedup Spr 2015, Apr 13 . . . = 1 N = (N 1) + 1/NN N(N 1) + 1 ELEC 5200-001/6200-001 Lecture 10 60 Minimum Run Time Minimize N processor run time, T(N) = (N 1) + 1/NN T(N)/N = 0 1/NN2 = 0, N = - Min. T(N) = 2 , because 2T(N)/N2 > 0. Maximum speedup = 1/T(N) = 1/(2 ) Example: = 0.01, Maximum speedup: N = 10 T(N) = 0.19 Speedup = 5.26 Spr 2015, Apr 13 . . .

ELEC 5200-001/6200-001 Lecture 10 61 Run Time Normalized run time, T(N) 1 T(N) = (N 1) + 1/NN (N 1) 1/N 1 0 10 20 30 Number of processors (N) Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 62

Speedup 12 Speedup, T(1)/T(N) 10 Ideal, N 8 N N(N 1) + 1 6 4 2 2 Spr 2015, Apr 13 . . . 4 6 8 10 Number of processors (N) ELEC 5200-001/6200-001 Lecture 10

12 63 Further Reading G. M. Amdahl, Validity of the Single Processor Approach to Achieving LargeScale Computing Capabilities, Proc. AFIPS Spring Joint Computer Conf., Atlantic City, NJ, Apr. 1967, pp. 483-485. J. L. Gustafson, Reevaluating Amdahls Law, Comm. ACM, vol. 31, no. 5, pp. 532-533, May 1988. M. D. Hill and M. R. Marty, Amdahls Law in the Multicore Era, Computer, vol. 41, no. 7, pp. 33-38, July 2008. D. H. Woo and H.-H. S. Lee, Extending Amdahls Law for Energy-Efficient Computing in the Many-Core Era, Computer, vol. 41, no. 12, pp. 24-31, Dec. 2008. S. M. Pieper, J. M. Paul and M. J. Schulte, A New Era of Performance Evaluation, Computer, vol. 40, no. 9, pp. 23-30, Sep. 2007. S. Gal-On and M. Levy, Measuring Multicore Performance, Computer, vol. 41, no. 11, pp. 99-102, November 2008. S. Williams, A. Waterman and D. Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures, Comm. ACM, vol. 52, no. 4, pp. 65-76, Apr. 2009. U. Vishkin, Is Multicore Hardware for General-Purpose Parallel Processing Broken? Comm. ACM, vol. 57, no. 4, pp. 35-39, Apr. 2014. Spr 2015, Apr 13 . . . ELEC 5200-001/6200-001 Lecture 10 64

Recently Viewed Presentations

  • Implementation of Basic Digital Filter Structures R.C. Maher

    Implementation of Basic Digital Filter Structures R.C. Maher

    Times New Roman Courier New Wingdings Symbol Default Design Implementation of Basic Digital Filter Structures Digital Filters FIR Filter Review FIR Setup FIR Code for 56300 IIR Filters IIR Issues: Stability and Sensitivity Overflow Issues Second-Order Sections Implementing 2nd Order...
  • Outline

    Outline

    8/1/19. Conclusion. Data re-transmission feature has been implement for the ROC. Up to 8 packet errors can be recovered. Con: Cannot re-transmit more than 8 packets
  • Calvin vs Wesley: Bringing Belief in Line with Practice

    Calvin vs Wesley: Bringing Belief in Line with Practice

    Prevenient Grace . porch / courtship . Justifying Grace . door / wedding. Sanctifying Grace . living room / marriage. Calvin's View of Salvation. From start to finish God's work. Effectual or irresistible. Salvation as union with Christ & faith...
  • Energy: Forms and Changes Questions to think about

    Energy: Forms and Changes Questions to think about

    Kinetic vs. Potential Practice . Classify the following as a type of potential energy or kinetic energy (use the letters K or P) 1. A bicyclist pedaling up a hill _____ 2. An archer with his bow drawn _____ 3....
  • Access to Finance for Small & Medium Enterprises

    Access to Finance for Small & Medium Enterprises

    Key stylized facts and empirical approach . Macroeconomic benefits and drivers of SME financial inclusion. The role of Fintech. Conclusion. 6/17/2019. Promoting new financing channels: the role of Fintech. Enhancing traditional SME finance . Credit information .
  • THE FOURTH SUNDAY OF LENT MARCH 30, 2014

    THE FOURTH SUNDAY OF LENT MARCH 30, 2014

    10:30 AM - Jean Doran by Ellen Doran and Family ***** REQUIESCAT IN PACE:May Jesus, who suffered and died for our sins, welcome to His eternal Realm MARY E. KLEIN, a former member of the former St. Mary Star of...
  • Chapter 12 Issues and Challenges in Transport Geography

    Chapter 12 Issues and Challenges in Transport Geography

    Lower energy consumption pet unit carried. Assets. Higher level of utilization of modes and infrastructure. Safety. Reduced number of accidents, fatalities and injuries. Accessibility. Improved accessibility; reduced friction of distance. Cross-border. Improved throughput and security. Infrastructure. Longer life cycle, improved...
  • Learning from Observations

    Learning from Observations

    Reducing variance. Gradient estimate (for a single trajectory): ∇??(?)≈?≥0?(?)∇? log??(??,??) Observation: it seems bad to weight each action in a trajectory by the return of the entire trajectory.