EECS 252 Graduate Computer Architecture Lec 01 - Introduction

EECS 262a Advanced Topics in Computer Systems Lecture 17 Comparison of Parallel DB, CS, MR and Jockey October 24th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs262 Todays Papers A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker. Appears in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2009 Jockey: Guaranteed Job Latency in Data Parallel Clusters Andrew D. Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. Appears in Proceedings of the European Professional Society on Computer Systems (EuroSys), 2012

Thoughts? 10/24/2012 cs262a-S12 Lecture-17 2 Two Approaches to Large-Scale Data Analysis Shared nothing MapReduce Distributed file system Map, Split, Copy, Reduce MR scheduler Parallel DBMS

Standard relational tables, (physical location transparent) Data are partitioned over cluster nodes SQL Join processing: T1 joins T2 If T2 is small, copy T2 to all the machines If T2 is large, then hash partition T1 and T2 and send partitions to different machines (this is similar to the split-copy in MapReduce) Query Optimization Intermediate tables not materialized by default 10/24/2012 cs262a-S12 Lecture-17 3 Architectural Differences

Parallel DBMS MapReduce Schema Support O X Indexing O X Programming Model Stating what you want

(SQL) Presenting an algorithm (C/C++, Java, ) Optimization O X Flexibility Spotty UDF Support Good Fault Tolerance

Not as Good Good 10/24/2012 cs262a-S12 Lecture-17 4 Schema Support Parallel MapReduce DBMS Flexible, Relational programmers schema required write code to interpret input data Good if

fordata single are application shared by multiple scenarioapplications Bad if data are shared by multiple applications. Must address data syntax, consistency, etc. 10/24/2012 cs262a-S12 Lecture-17 5 Programming Model & Flexibility Parallel MapReduce DBMS Low SQL level

user-defined Anecdotal evidence functions, from stored the MR procedures, community suggests that there is widespread user-defined aggregates sharing of MR code fragments to do common tasks, such as joining data sets. very flexible 10/24/2012 cs262a-S12 Lecture-17 6

Indexing Parallel MapReduce DBMS No Hash/b-tree native index indexes support well supported Programmers can implement their own index support in Map/Reduce code But hard to share the customized indexes in multiple applications 10/24/2012 cs262a-S12 Lecture-17 7

Execution Strategy & Fault Tolerance Parallel MapReduce DBMS Intermediate results are pushed saved toacross local files network If a node fails, run must the re-run node-task the entire again query on another node At a mapper machine, when multiple reducers are reading multiple local files, there could be large

numbers of disk seeks, leading to poor performance. 10/24/2012 cs262a-S12 Lecture-17 8 Avoiding Data Transfers Parallel MapReduce DBMS Schedule A lot of optimizations Map to close to data Such But other as determine than this,where programmers

to perform must filtering avoid data transfers themselves 10/24/2012 cs262a-S12 Lecture-17 9 Performance Benchmarks Benchmark Environment Original MR task (Grep) Analytical Tasks

10/24/2012 Selection Aggregation Join User-defined-function (UDF) aggregation cs262a-S12 Lecture-17 10 Node Configuration 100-node cluster Each node: 2.40GHz Intel Core 2 Duo, 64-bit red hat enterprise linux 5 (kernel 2.6.18) w/ 4Gb RAM and two 250GB SATA HDDs. Nodes are connected to Cisco Catalyst 3750E 1Gbps switches. Internal switching fabric has 128Gbps. 50 nodes per switch.

Multiple switches are connected to create a 64Gbps Cisco StackWise plus ring. The ring is only used for cross-switch communications. 10/24/2012 cs262a-S12 Lecture-17 11 Tested Systems Hadoop (0.19.0 on Java 1.6.0) HDFS data block size: 256MB JVMs use 3.5GB heap size per node Rack awareness enabled for data locality

Three replicas w/o compression: Compression or fewer replicas in HDFS does not improve performance DBMS-X (a parallel SQL DBMS from a major vendor) Row store 4GB shared memory for buffer pool and temp space per node Compressed table (compression often reduces time by 50%) Vertica Column store 256MB buffer size per node Compressed columns by default 10/24/2012 cs262a-S12 Lecture-17 12 Benchmark Execution

Data loading time: Actual loading of the data Additional operations after the loading, such as compressing or building indexes Execution time DBMS-X and vertica: Final results are piped from a shell command into a file Hadoop: Final results are stored in HDFS An additional step to combine the multiple files into a single file 10/24/2012 cs262a-S12 Lecture-17 13 Performance Benchmarks

Benchmark Environment Original MR task (Grep) Analytical Tasks 10/24/2012 Selection Aggregation Join User-defined-function (UDF) aggregation cs262a-S12 Lecture-17 14 Task Description

From MapReduce paper Input data set: 100-byte records Look for a three-character pattern One match per 10,000 records Varying the number of nodes Fix the size of data per node (535MB/node) Fix the total data size (1TB) 10/24/2012 cs262a-S12 Lecture-17 15 Data Loading Hadoop: Copy text files into HDFS in parallel DBMS-X:

Load SQL command executed in parallel: it performs hash partitioning and distributes records to multiple machines Reorganize data on each node: compress data, build index, perform other housekeeping This happens in parallel Vertica: Copy command to load data in parallel Data is re-distributed, then compressed 10/24/2012 cs262a-S12 Lecture-17 16 Data Loading Times DBMS-X: grey is loading, white is re-organization after loading

Loading is actually sequential despite parallel load commands Hadoop does better because it only copies the data to three HDFS replicas. 10/24/2012 cs262a-S12 Lecture-17 17 Execution SQL: SELECT * FROM data WHERE field LIKE %XYZ% Full table scan MapReduce: Map: pattern search No reduce An additional MR job to combine the output into a single file

10/24/2012 cs262a-S12 Lecture-17 18 Execution time Combine output grep Hadoops large start-up cost shows up in Figure 4, when data per node is small Verticas good data compression 10/24/2012 cs262a-S12 Lecture-17 19

Performance Benchmarks Benchmark Environment Original MR task (Grep) Analytical Tasks 10/24/2012 Selection Aggregation Join User-defined-function (UDF) aggregation cs262a-S12 Lecture-17 20

Input Data Input #1: random HTML documents Inside an html doc, links are generated with Zipfian distribution 600,000 unique html docs with unique urls per node Input #2: 155 million UserVisits records 20GB/node Input #3: 18 million Ranking records 1GB/node 10/24/2012 cs262a-S12 Lecture-17 21 Selection Task Find the pageURLs in the rankings table

(1GB/node) with a pageRank > threshold 36,000 records per data file (very selective) SQL: SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X; MR: single Map, no Reduce 10/24/2012 cs262a-S12 Lecture-17 22 Hadoops start-up cost; DBMS uses index; verticas reliable message layer becomes bottleneck 10/24/2012 cs262a-S12 Lecture-17

23 Aggregation Task Calculate the total adRevenue generated for each sourceIP in the UserVisits table (20GB/node), grouped by the sourceIP column. Nodes must exchange info for computing groupby Generate 53 MB data regardless of number of nodes SQL: SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; MR: Map: outputs (sourceIP, adRevenue) Reduce: compute sum per sourceIP Combine is used 10/24/2012

cs262a-S12 Lecture-17 24 DBMS: Local group-by, then the coordinator performs the global group-by; performance dominated by data transfer. 10/24/2012 cs262a-S12 Lecture-17 25 Join Task Find the sourceIP that generated the most revenue within Jan 15-22, 2000, then calculate the average pageRank of all the pages visited by the sourceIP during this interval. SQL: SELECT INTO Temp sourceIP,

AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(2000-01-15) AND Date(2000-01-22) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; 10/24/2012 cs262a-S12 Lecture-17 26 MR Phase 1: filter UserVisits that are outside the desired date range, joins the qualifying records

with records from the Ranking file. Phase 2: compute total adRevenue and average pageRank per sourceIP Phase 3: produce the largest record 10/24/2012 cs262a-S12 Lecture-17 27 DBMS uses index, both relations are partitioned on the join key MR has to read all data! Yet DBMS originally had to scan everything during loading MR phase 1 takes an average 1434.7 seconds 600 seconds of raw I/O to read the table; 300 seconds to split, parse, deserialize; Thus CPU overhead is the limiting factor Load times for DBMS in 10K cs262a-S12 second range!

10/24/2012 Lecture-17 28 UDF Aggregation Task Compute inlink count per document SQL: SELECT INTO Temp F(contents) FROM Documents; SELECT url, SUM(value) FROM Temp GROUP BY url; Need a user-defined-function to parse an html doc. Both DBMSs do not support UDF very well. For example, get around by opening the html doc from local disk rather than perform all processing in DBMS. MR: A standard MR program 10/24/2012 cs262a-S12 Lecture-17

29 DBMS: lower UDF time; upper other query time UDF Support not great for either DBMS Hadoop: lower query time; upper: combine all results into one 10/24/2012 cs262a-S12 Lecture-17 30 Discussion Is this a fair comparison?? Load time not fairly compared here (especially in big Join task!) Many Hadoop tasks followed by other tasks. Thus, dont need to spend time collecting everything into one file Are Conclusions about MapReduce Fair?

Nothing about the MapReduce framework requires continuous reparsing of input information Nothing about the MapReduce framework prevents use of Indices Hadoop task start-up cost big but not inherent 100 node system: 10 second till the first task starts, 25 seconds till all nodes run tasks What happens when MapReduce framework incorporates lessons of DMBS? Then the real difference becomes about expression Declarative (SQL) vs Imperitive (MR) 10/24/2012 cs262a-S12 Lecture-17 31 Alternative: HadoopDB The Basic Idea (An Architectural Hybrid of MR & DBMS)

To use MR as the communication layer above multiple nodes running single-node DBMS instances Queries expressed in SQL, translated into MR by extending existing tools As much work as possible is pushed into the higher performing single node databases| How many of complaints from Comparison paper still apply here? 10/24/2012 cs262a-S12 Lecture-17 32

Is this a good paper? What were the authors goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the Test of Time challenge? How would you review this paper today? 10/24/2012 cs262a-S12 Lecture-17 33 Domain for Jocky: Large cluster jobs Predictability very important

Enforcement of Deadlines one way toward predictability 10/24/2012 cs262a-S12 Lecture-17 34 Variable Execution Latency: Prevalent 4.3x Even for job with narrowest latency profile Over 4.3X variation in latency Reasons for latency variation: Pipeline complexity Noisy execution environment Excess resources

10/24/2012 cs262a-S12 Lecture-17 35 Model of individual Job: graph of interconnected stages Job Tasks Stag e 10/24/2012 cs262a-S12 Lecture-17 36

Dryads Dag Workflow Many simultaneous job pipelines executing at once Some on behalf of Microsoft, others on behalf of customers 10/24/2012 cs262a-S12 Lecture-17 37 Compound Workflow Deadlin e Deadlin e Deadlin

e Deadlin e Deadline Dependencies mean that deadlines on complete pipeline create deadlines on constituent jobs Median jobs output used by 10 additional jobs 10/24/2012 cs262a-S12 Lecture-17 38 Best way to express performance targets Priorities? Not expressive enough Weights? Difficult for users to set Utility curves? Capture deadline & penalty

Jockeys goal: Maximize utility while minimizing resources by dynamically adjusting the allocation 10/24/2012 cs262a-S12 Lecture-17 39 Application Modeling C(progress, allocation) remaining run time Techniques: Job simulator: Input from profiling to a simulator which explores possible scenarios Compute Amdahls Law Time = S + P/N

Estimate S and P from standpoint of current stage Progress metric? Many explored totalworkWithQ: Total time completed tasks spent enqueued or executing Optimization: Minimum allocation that maximizes utility cs262a-S12 Lecture-17 10/24/2012 40 Ex: Completion (1%), Deadline(50 min) 10 nodes 20 nodes 30 nodes

1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes

36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes JOCKEY CONTROL LOOP 10/24/2012 cs262a-S12 Lecture-17 41 Ex: Completion (3%), Deadline(50 min) 10 nodes

20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes

22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes JOCKEY CONTROL LOOP 10/24/2012 cs262a-S12 Lecture-17 42

Ex: Completion (5%), Deadline(30 min) 10 nodes 20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes

3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes JOCKEY CONTROL LOOP 10/24/2012

cs262a-S12 Lecture-17 43 Jockey in Action Initial deadline: 140 minutes 10/24/2012 cs262a-S12 Lecture-17 44 Jockey in Action New deadline: 10/24/2012

70 minutes cs262a-S12 Lecture-17 45 Jockey in Action Release resources due to excess pessimism New deadline: 10/24/2012 70 minutes

cs262a-S12 Lecture-17 46 Jockey in Action Oracle allocation: Total allocationhours 10/24/2012 cs262a-S12 Lecture-17 47 Jockey in Action Available parallelism less than allocation

Oracle allocation: Total allocation-hours Deadline 10/24/2012 cs262a-S12 Lecture-17 48 Jockey in Action Allocation above oracle Oracle allocation: Total allocation-hours Deadline 10/24/2012 cs262a-S12 Lecture-17

49 Evaluation 1.4x Jobs which met the SLO 10/24/2012 cs262a-S12 Lecture-17 50 Evaluation Missed 1 of 94 deadlines mulator made good predictions: 80% finish before deadline

Allocated too many resources 10/24/2012 Control loop is stable and successful cs262a-S12 Lecture-17 51 Evaluation 10/24/2012 cs262a-S12 Lecture-17 52

52 Is this a good paper? What were the authors goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the Test of Time challenge? How would you review this paper today? 10/24/2012 cs262a-S12 Lecture-17 53

Recently Viewed Presentations

  • Spreadsheets - Microsoft Excel

    Spreadsheets - Microsoft Excel

    Follows the BODMAS rule. Logical based uses comparison operators like > < >= signs. FORMULAS. Creating and Editing formulas. A formula starts with an = sign. Use a formula bar or cell to create or edit a formula. Made up...
  • Chemical Bonding - Collier&#x27;s Classroom

    Chemical Bonding - Collier's Classroom

    Chemical Bonding. Chemical Bond. ... NH. 4 + ammonium ion ... Draw the Lewis dot structure? How many lone pairs does it have? Why? 1-2s electron is moved up to the 2p orbital so that it can have 4 bonds...
  • MARB Requirement Template

    MARB Requirement Template

    The PPM Estimator on www.move.mil provides a range (very small variance) of what your dollar incentive will be based on your estimated weight from the authorized origin to the authorized destination. This estimate does not take into consideration the applicable...
  • The Art of Listening

    The Art of Listening

    What is Marriage Ministry? Peer to peer ministry - ministering to married couples and families through every stage of family life. Accompaniment - witness - mentoring. All ministry is founded on the stages of Evangelization . Meeting people where they...
  • MTPL Insurance Market in Macedonia and the Respective

    MTPL Insurance Market in Macedonia and the Respective

    Channel distribution Accidents and claims Claims paid and reserves indicators Expenses Uninsured vehicles Complaints Source: ISA National Insurance bureau Ministry of interior in RM MTPL Market Liberalization Project in Macedonia, World Bank Inception Report, May 2012 Axco Global Statistics/Industry Associations...
  • Take off with history! Welcome to August 1!

    Take off with history! Welcome to August 1!

    1521: Cortez captures the Aztec capital . of Tenochtitlan and ended the Aztec . Empire. The city connected 5 large lakes. and also provided a series of bridges. What sounds appealing to living in a city . that is surrounded...
  • Unit 1, Part 3 Notes - Community Ecology

    Unit 1, Part 3 Notes - Community Ecology

    How do we define a species' role within a community? Ecological Niche. ... Types of Competition: Intraspecific vs. interspecific. Competitive Exclusion Principle (when niches of two species fully overlap, one species will go extinct) Resource Partitioning.
  • Ot Literature P&amp;R 2010

    Ot Literature P&R 2010

    (poetic speeches) full of images and allusions - difficult for modern audience to follow. c) The speeches were . for specific occasions, the details of which are often lost; and were subsequently written for use by secondary audiences (i.e. they...