Introduction to Database Systems CSE 444 Lecture 20: Operator Algorithms Magda Balazinska - CSE 444, Fall 2010 1Why Learn About Op Algos? • Implemented in commercial DBMSs – DBMSs implement different subsets of known algorithms • Good algorithms can greatly improve performance • Need to know about physical operators to understand query optimization Magda Balazinska - CSE 444, Fall 2010 2Magda Balazinska - CSE 444, Fall 2010 Cost Parameters • In database systems the data is on disk • Cost = total number of I/Os • Parameters: – B(R) = # of blocks (i.e., pages) for relation R – T(R) = # of tuples in relation R – V(R, a) = # of distinct values of attribute a • When a is a key, V(R,a) = T(R) • When a is not a key, V(R,a) can be anything < T(R) 3Magda Balazinska - CSE 444, Fall 2010 Cost • Cost of an operation = number of disk I/Os to – Read the operands – Compute the result • Cost of writing the result to disk is not included – Need to count it separately when applicable 4Magda Balazinska - CSE 444, Fall 2010 Cost of Scanning a Table • Result may be unsorted: B(R) • Result needs to be sorted: 3B(R) – We will discuss sorting later 5Magda Balazinska - CSE 444, Fall 2010 Outline for Today • Join operator algorithms – One-pass algorithms (Sec. 15.2 and 15.3) – Index-based algorithms (Sec 15.6) – Two-pass algorithms (Sec 15.4 and 15.5) – Note about readings: • In class, we will discuss only algorithms for join operator (because other operators are easier) • Read the book to get more details about these algos • Read the book to learn about algos for other operators 6Magda Balazinska - CSE 444, Fall 2010 Basic Join Algorithms • Logical operator: – Product(pname, cname) ⋈ Company(cname, city) • Propose three physical operators for the join, assuming the tables are in main memory: – Hash join – Nested loop join – Sort-merge join 7Magda Balazinska - CSE 444, Fall 2010 Hash Join Hash join: R ⋈ S • Scan R, build buckets in main memory • Then scan S and join • Cost: B(R) + B(S) • One-pass algorithm when B(R) <= M – By “one pass”, we mean that the operator reads its operands only once. It does not write intermediate results back to disk. 8Hash Join Example 9 Patient Insurance Patient(pid, name, address) Insurance(pid, provider, policy_nb) 1 ‘Bob’ ‘Seattle’ 2 ‘Ela’ ‘Everett’ 3 ‘Jill’ ‘Kent’ 4 ‘Joe’ ‘Seattle’ Patient 2 ‘Blue’ 123 4 ‘Prem’ 432 Insurance 4 ‘Prem’ 343 3 ‘GrpH’ 554 Two tuples per pageHash Join Example 10 Patient Insurance 1 2 3 4 Patient 2 4 Insurance 4 3 Showing pid only 8 5 9 6 2 8 8 9 6 6 1 3 Disk Memory M = 21 pagesHash Join Example 11 Step 1: Scan Patient and create hash table in memory 1 2 3 4 Patient 2 4 Insurance 4 3 8 5 9 6 2 8 8 9 6 6 1 3 Disk Memory M = 21 pages Hash h: pid % 5 Input buffer 1 2 4 3 9 6 8 5 1 2Hash Join Example 12 Step 2: Scan Insurance and probe into hash table 1 2 3 4 Patient 2 4 Insurance 4 3 8 5 9 6 2 8 8 9 6 6 1 3 Disk Memory M = 21 pages Hash h: pid % 5 Input buffer 1 2 4 3 9 6 8 5 1 2 2 4 Output buffer 2 2 Write to diskHash Join Example 13 Step 2: Scan Insurance and probe into hash table 1 2 3 4 Patient 2 4 Insurance 4 3 8 5 9 6 2 8 8 9 6 6 1 3 Disk Memory M = 21 pages Hash h: pid % 5 Input buffer 1 2 4 3 9 6 8 5 1 2 2 4 Output buffer 4 4Hash Join Example 14 Step 2: Scan Insurance and probe into hash table 1 2 3 4 Patient 2 4 Insurance 4 3 8 5 9 6 2 8 8 9 6 6 1 3 Disk Memory M = 21 pages Hash h: pid % 5 Input buffer 1 2 4 3 9 6 8 5 1 2 4 3 Output buffer 4 4 Keep going until read all of Insurance Cost: B(R) + B(S)Hash Join Details 15 Open( ) { H = newHashTable( ); S.Open( ); x = S.GetNext( ); while (x != null) { H.insert(x); x = S.GetNext( ); } S.Close( ); R.Open( ); buffer = [ ]; }Hash Join Details 16 GetNext( ) { while (buffer == [ ]) { x = R.GetNext( ); if (x==Null) return NULL; buffer = H.find(x); } z = buffer.first( ); buffer = buffer.rest( ); return z; }Hash Join Details 17 Close( ) { release memory (H, buffer, etc.); R.Close( ) } Magda Balazinska - CSE 444, Fall 2010Magda Balazinska - CSE 444, Fall 2010 Nested Loop Joins • Tuple-based nested loop R ⋈ S • R is the outer relation, S is the inner relation • Cost: B(R) + T(R) B(S) • Not quite one-pass since S is read many times for each tuple r in R do for each tuple s in S do if r and s join then output (r,s) 18Magda Balazinska - CSE 444, Fall 2010 Page-at-a-time Refinement • Cost: B(R) + B(R)B(S) for each page of tuples r in R do for each page of tuples s in S do for all pairs of tuples if r and s join then output (r,s) 191 2 Nested Loop Example 20 1 2 3 4 Patient 2 4 Insurance 4 3 8 5 9 6 2 8 8 9 6 6 1 3 Disk Input buffer for Patient Output buffer 2 2 Input buffer for Insurance 2 4Nested Loop Example 21 1 2 3 4 Patient 2 4 Insurance 4 3 8 5 9 6 2 8 8 9 6 6 1 3 Disk Input buffer for Patient 1 2 Output buffer Input buffer for Insurance 4 3 1 2Nested Loop Example 22 1 2 3 4 Patient 2 4 Insurance 4 3 8 5 9 6 2 8 8 9 6 6 1 3 Disk Input buffer for Patient 1 2 Output buffer Input buffer for Insurance 2 8 1 2 2 2 Cost: B(R) + B(R)B(S) Keep going until read all of Insurance Then repeat for next page of Patient… until end of PatientMagda Balazinska - CSE 444, Fall 2010 Sort-Merge Join Sort-merge join: R ⋈ S • Scan R and sort in main memory • Scan S and sort in main memory • Merge R and S • Cost: B(R) + B(S) • One pass algorithm when B(S) + B(R) <= M • Typically, this is NOT a one pass algorithm 23Sort-Merge Join Example 24 1 2 3 4 Patient 2 4 Insurance 4 3 8 5 9 6 2 8 8 9 6 6 1 3 Disk Memory M = 21 pages 1 2 4 3 9 6 8 5 Step 1: Scan Patient and sort in memorySort-Merge Join Example 25 …
View Full Document