UW CSE 444 - Introduction to Database Systems - D2120744

Home> Schools> University of Washington> Computer Science and Engineering (CSE) > CSE 444> Introduction to Database Systems

DOC PREVIEW

UW CSE 444 - Introduction to Database Systems

School name University of Washington

Course Cse 444- Database Internals

Pages 49

This preview shows page 1-2-3-23-24-25-26-47-48-49 out of 49 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Introduction to Database Systems CSE 444, Winter 2011 Lecture 24: Parallel Databases h"p://www.cs.washington.edu/educa3on/courses/cse444/11wi/7777Version7March74,720117Outlook 27Overview  Parallel7architectures7and7operators:7Ch.720.17 Map@reduce:7Ch.720.27 Semijoin7reduc3ons,7full7reducers:7Ch.720.47 We7covered7this7a7few7lectures7ago737Parallel v.s. Distributed Databases  Parallel7database7system:7 Improve7performance7through7parallel7implementa3on7 Distributed7database7system:7 Data7is7stored7across7several7sites,7each7site7managed7by7a7DBMS7capable7of7running7independently747Parallel DBMSs 57 Goal7 Improve7performance7by7execu3ng7mul3ple7opera3ons7in7parallel77 Key7beneﬁt7 Cheaper7to7scale7than7relying7on7a7single7increasingly7more7powerful7processor7 Key7challenge7 Ensure7overhead7and7conten3on7do7not7kill7performance7Performance Metrics for Parallel DBMSs  Speedup77 More7processors7→7higher7speed7 Individual7queries7should7run7faster7 Should7do7more7transac3ons7per7second7(TPS)7 Fixed7problem7size7overall,7vary7#7of7processors7("strong7scaling”)7 Scaleup7 More7processors7→7can7process7more7data7 Fixed7problem7size7per(processor,7vary7#7of7processors(("weak7scaling”)7 Batch7scaleup7 Same7query7on7larger7input7data7should7take7the7same73me7 Transac3on7scaleup7 N@3mes7as7many7TPS7on7N@3mes7larger7database7 But7each7transac3on7typically7remains7small767Linear v.s. Non-linear Speedup 77#7processors7(=P)7Speedup7Linear v.s. Non-linear Scaleup 87#7processors7(=P)7AND7data7size77Batch7Scaleup7×17 ×57×107×157Challenges to Linear Speedup and Scaleup  Startup7cost77 Cost7of7star3ng7an7opera3on7on7many7processors7 Interference7 Conten3on7for7resources7between7processors7 Skew7 Slowest7processor7becomes7the7bo"leneck797 Shared7memory7 Shared7disk7 Shared7nothing7107Architectures for Parallel DatabasesShared Memory 117Interconnec3on7Network7P7 P7Global7Shared7Memory7D7 D7 D7P7Shared Disk 127Interconnec3on7Network7P7 P7 P7M7 M7 M7D7 D7 D7Shared Nothing 137Interconnec3on7Network7P7 P7 P7M7 M7 M7D7 D7 D7Shared Nothing  Most7scalable7architecture7 Minimizes7interference7by7minimizing7resource7sharing7 Can7use7commodity7hardware7 Also7most7diﬃcult7to7program7and7manage7 Processor7=7server7=7node77 “Processor”7!=7core7 P7=7number7of7nodes7147We7will7focus7on7shared7nothing7Important7ques3on:7what7exactly7can7we7actually7parallelize7in7a7parallel7database?7 Inter@query7parallelism7 Each7query7runs7on7one7processor7 Inter@operator7parallelism7 A7query7runs7on7mul3ple7processors7 An7operator7runs7on7one7processor7 Intra@operator7parallelism7 An7operator7runs7on7mul3ple7processors7157Taxonomy for Parallel Query EvaluationHorizontal Data Partitioning  Rela3on7R7split7into7P7chunks7R0,7…,7RP@1,7stored7at7the7P7nodes7 37ways7to7horizontally7par33on7a7rela3on:7 Round7robin:7tuple7ti7to7chunk7(i7mod7P)7 Hash7based7par33oning7on7a"ribute7A:7 Tuple7t7to7chunk7h(t.A)7mod7P7 Range7based7par33oning7on7a"ribute7A:7 Tuple7t7to7chunk7i7if7vi@17<7t.A7<7vi7167Horizontal Data Partitioning  All7three7choices7are7just7special7cases:7 For7each7tuple,7compute7bin(=(f(t)( Diﬀerent7proper3es7of7the7func3on7f7determine7hash7vs.7range7vs.7round7robin7vs.7anything7177Parallel Selection Compute7σA=v(R),7or7σv1<A<v2(R)7 On7a7conven3onal7database:7cost7=7B(R)7(cost7is7here7~73me,7and7in7worst7case)7 Q:7What7is7the7cost7on7a7parallel7database7with7P7processors7?7 Round7robin7 Hash7par33oned7 Range7par33oned7187Parallel Selection  Q:7What7is7the7cost7on7a7parallel7database7with7P7processors7?7(cost7is7here7~73me,7and7in7worst7case)7 σ:7B(R)7/7P7in7all7cases7 However,7diﬀerent7processors7do7the7work:7 Round7robin:7all7servers7do7the7work7 Hash:7one7server7for7σA=v(R),7all7for7σv1<A<v2(R)7 Range:7one7server7only7197Data Partitioning Revisited 207What7are7the7pros7and7cons7?7 Round7robin7 Good7load7balance7but7always7needs7to7read7all7the7data7 Hash7based7par33oning7 Good7load7balance7but7works7only7for7equality7predicates7and7full7scans77 Range7based7par33oning7 Works7well7for7range7predicates7but7can7suﬀer7from7data7skew7Parallel Group By: γA, sum(B)(R)  Step71:7server7i7par33ons7chunk7Ri7using7a7hash7func3on7h(t.A)7mod7P:7Ri0,7Ri1,7…,7Ri,P@1777 Step72:7server7i7sends7par33on7Rij7to7serve7j7 Step73:77server7j7computes7γA,7sum(B)7on77R0j,7R1j,7…,7RP@1,j77217Parallel Group By: γA, sum(B)(R): Cost Recall7conven3onal7cost7=773B(R)7 Cost7of7Step71:77B(R)/P77I/O7opera3ons7 Cost7of7Step72:7(P@1)/P7B(R)7blocks7are7sent,7no7cost7 Network7costs7assumed7to7be7much7lower7than7I/O7 Cost7of7Step73:727B(R)/P7 Why7?7 When7can7we7reduce7it7to707?7Total7=73B(R)7/7P777(communica3on7costs7ignored)7227Parallel Group By: γA, sum(B)(R)  Can7we7do7be"er?7 Sum?7 Count?7 Avg?7 Max?7 Median?7237Parallel Group By: γA, sum(B)(R)  Sum(B)7=7Sum(B0)7+7Sum(B1)7+7…7+7Sum(Bn)7 Count(B)7=7Count(B0)7+7Count(B1)7+7…7+7Count(Bn)7 Max(B)7=7Max(Max(B0)7+7Max(B1)7+7…7+7Max(Bn))7 Avg(B)7=7Sum(B)7/7Count(B)7 Median(B)7=77247distribu3ve7(bad7name)7algebraic7holis3c7Source:7Naming7from:7[Gray7et7al.:7DataCube,7ICDE71996]7Parallel Join: R ⋈A=B S  Step717 For7all7servers7in7[0,k],7server7i7par33ons7chunk7Ri7using7a7hash7func3on7h(t.A)7mod7P:7Ri0,7Ri1,7…,7Ri,P@1777 For7all7servers7in7[k+1,P],7server7j7par33ons7chunk7Sj7using7a7hash7func3on7h(t.A)7mod7P:7Sj0,7Sj1,7…,7Rj,P@17777 Step72:77 Server7i7sends7par33on7Riu7to7server7u7 Server7j7sends7par33on7Sju7to7server7u77 Steps73:7Server7u7computes7the7join7of7Riu7with7Sju7257Parallel Join: R ⋈A=B S: Cost  Step71:77(B(R)7+7B(S))/P7 Step72:7707 (P@1)/P7(B(R)7+7B(S))7blocks7are7sent,7but7we7assume7network7costs7to7be7<<7disk7I/O7costs7 Step73:7 07if7smaller7table7ﬁts7in7main7memory:7B(S)/p7<=M7 4(B(R)+B(S))/P7otherwise77

View Full Document