U of I CS 525 - A comparison of approaches to large-scale data analysis - D2778997

Home> Schools> University of Illinois> Computer Science (CS) > CS 525> A comparison of approaches to large-scale data analysis

DOC PREVIEW

U of I CS 525 - A comparison of approaches to large-scale data analysis

School name University of Illinois

Course Cs 525- Advanced Graphics Processor Programming

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 525 Advanced Distributed Systems Spring 2010 A comparison of approaches to large scale data analysis Indranil Gupta Indy A Pavlo et al SIGMOD 2009 Old Wine Stale or Vintage April 27 2010 there is also a shorter paper in CACM Jan 2010 1 2 All Slides IG Databases vs Clouds Relational Database Consists of schemas Employee schema Company schema Schema table consisting of tuples Employee name company Company name number of employees Basic Question Why use MapReduce MR when parallel databases have been around for decades and have been successful Written by experts in databases research including those who defined relational DBs many decades ago Some in databases community felt they might be hurt by these new fangled cloud computing paradigms which were just reiventing the wheel Tuple consists of multiple fields including primary key foreign key etc Employee name company 3 Advantages of Parallel DBs over MR Parallel DB is similar to MR Parallel DBs parallelize query operation across multiple servers Parallel DBs data processing consists of 3 phases Schema Support more structured storage of data Really needed Indexing e g B trees Really needed What about BigTable E g consider joining two tables T1 and T2 Programming model more expressive 1 Filter T1 and T2 parallelly Map 2 Distribute the larger of T1 and T2 across multiple nodes then broadcast the other Ti to all nodes Shuffle 3 Perform join at each node and store back answers Reduce Parallel DBs used in the paper DBMS X and Vertica MR representative Hadoop SQL queries run on multiple schemas very 4 efficiently What about Hive and Pig Latin Execution strategy push data instead of pull as in MR They argue it reduces bottlenecks Really What about the network I O bottleneck 5 6 Load Times Actual Execution In general best for Hadoop The cost of structured schema and indexing Data loaded on nodes sequentially implementation artifact Are these operations really needed Hadoop always seems worst Overhead rises linearly with data stored per node Vertica is best 7 Other Tasks Aggressive compression More effective as more data stored per node 8 What have we learnt Load time advantage is in 1000s of seconds for MR while execution time disadvantage is in 10s of seconds Means what Pre processing cuts down on processing time E g selection task Vertica and DBMS X already use an index on pageRank column No reason why we can t do this preprocessing using MR MR was never built for Join What about MapReduce Merge UDF does worst because of row by row interaction of DB and input file which is outside DB How common is this in parallel DBs 9 MapReduce is better matched to on demand computations while parallel DBs are better for repeated computations 10 Discussion Points On death taxes and the convergence of peer to peer and Grid Computing Performance vs Complexity does the study account for this tradeoff Which is more complex RDB s or MR I Foster et al IPTPS 2003 It is not clear how many MR users really need 1000 nodes page 2 What do you think What about shared Hadoop services 11 12 Wisconsin Context Written by the father of Grid Computing Ian Foster Written in an era where p2p was very active research area Grid computing was very active P2P and Grid communities were separate Distributed Computing Resources NCSA MIT Different researchers and conferences with not much interaction between them Cloud computing had not yet emerged 13 14 An Application Coded by a Physicist Output files of Job 0 Job 0 Wisconsin Job 0 Job 1 Job 2 Input to Job 2 Job 3 Job 1 NCSA MIT Job 2 Jobs 1 and 2 can be concurrent Job 3 Condor Protocol Output files of Job 2 15 Input to Job 3 Wisconsin Job 0 Job 1 16 The Grid Recently Some are 40Gbps links The TeraGrid links Job 2 Job 3 Globus Protocol MIT NCSA 17 A parallel Internet 18 Some Things Grid Researchers Consider Important Single sign on collective job set should require once only user authentication Mapping to local security mechanisms some sites use Kerberos others using Unix Delegation credentials to access resources inherited by subcomputations e g job 0 to job 1 Community authorization e g third party authentication P2P Grid We must address scale failure We need infrastructure For clouds you need to additionally worry about 19 failures scale on demand nature and so on 20 Definitions Definitions Grid Infrastructure that provides dependable consistent pervasive and inexpensive access to high end computational capabilities 1998 A system that coordinates resources not subject to centralized control using open general purpose protocols to deliver nontrivial QoS 2002 P2P Applications that takes advantage of resources at the edges of the Internet 2000 Decentralized self organizing distributed systems in which all or most communication is symmetric 2002 21 Grid versus P2P Pick your favorite Grid Infrastructure that provides dependable consistent pervasive and inexpensive access to high end computational capabilities 1998 A system that coordinates resources not subject to centralized control using open general purpose protocols to deliver nontrivial QoS 2002 P2P 525 good legal applications without intellectual fodder Applications that takes advantage of resources at the edges of the Internet 2000 Decentralized self organizing distributed systems in which all or most communication is symmetric 2002 525 clever designs without good legal applications 22 Applications Grid Often complex involving various combinations of Data manipulation Computation Tele instrumentation P2P Some File sharing Number crunching Content distribution Measurements Legal Applications Wide range of computational models e g Embarrassingly Tightly coupled Workflow Consequence Low Complexity Consequence 23 Complexity often inherent in the application itself 24 Applications Grid Often complex involving various combinations of Data manipulation Computation Tele instrumentation Applications P2P Some Grid Often complex involving various combinations of File sharing Number crunching Content distribution Measurements Data manipulation Computation Tele instrumentation Legal Applications Wide range of computational models e g Embarrassingly Tightly coupled Workflow Embarrassingly Tightly coupled Workflow Consequence Low Complexity 25 Scale and Failure Large amounts of activity 4 5 TB day D0 experiment Approaches to failure reflect assumptions E g centralized components P2P V large numbers of entities FastTrackC 4 277 745 iMesh 1 398 532 Large amounts of activity 4 5 TB day D0

View Full Document