Computing the cubeOn the Computation of Multidimensional AggregatesMotivationWhat is a CUBE?ApproachesPossible optimizationsSort based methodsPipeSortSearch lattice: CUBE-BY ABCDSearch lattice (contd...)PipeSort (contd...)Min cost matchingAlgorithm PipeSortExample planTweaksHash based methodsPipeHashPipeHash: OverviewSlide 19HeuristicsAlgorithm:Select_subtree(T)Compute_subtree(T’)Notes on PipeHashOverlap methodSorted runsPartitionsOverviewOverview (contd...)Choosing parent cuboidsExample cuboid treeChoosing overlapping cuboidsSlide 33Cuboid computationCuboid computation (contd...)Example CUBE computationAn array based algorithm for simultaneous multidimensional aggregatesROLAP Vs MOLAPMOLAP systemsProblemEfficient array storage: IssuesChunkingCompressing sparse arraysLoading arrays from tablesBasic algo (No overlapping)GeneralizationBasic array cubing algorithm:Example3-D array (Dim order ABC)Multi-Way algorithmMemory requirementsSlide 52Minimum Memory Spanning TreeMMSTOptimal dimension orderPowerPoint PresentationMulti-pass Multi-way Array AlgorithmHeuristicAlgorithmROLAP vs MOLAP1Computing the cube Abhinandan Das CS 632 Mar 8 20012On the Computation of Multidimensional AggregatesSameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey Naughton, Raghu Ramakrishnan & Sunita Sarawagi -- VLDB 19963MotivationOLAP / Multidimensional data analysisEg: Transactions(Prod,Date,StoreId,Cust,Sales)Sum of sales by: (P,SId) ; (P) ; (P,D,SId)Computing multidimensional aggregates is a performance bottleneckEfficient computation of several related group-bys4What is a CUBE?n-dimensional generalization of the group by operatorGroup-bys corresponding to all possible subsets of a given set of attributesEg: SELECT P, D, C, Sum(Sales) FROM Transactions CUBE-BY P, D, CALL, P, D, C, PD, PC, DC, PDC5ApproachesBasic group-by methods:Sort based Hash based Naïve approach6Possible optimizations1.1.Smallest parentSmallest parent2.2.Cache resultsCache results3.3.Amortize scansAmortize scans4.4.Share sortsShare sorts5.5.Share partitionsShare partitionsOften contradictoryAssumption: Distributive aggregate functionsum, count, min, max ; average-- Non distributive: median7Sort based methodsAlgorithm PipeSortShare-sorts Vs Smallest parentOptimize to get minimum total costCache-results & amortize-scansPipelining: ABCD ABC AB A8PipeSortAssumption: Have an estimate of the number of distinct values for each group-byInput: Search latticeGraph where each vertex represents a group-by of the cubeEdge i j if |i|=|j|+1, ji 9Search lattice: CUBE-BY ABCD10Search lattice (contd...)Each edge eij associated with two costs:S(eij): Cost of computing j from i when i is pre-sortedU(eij): Cost of computing j from i when i is not already sortedIdea: If attribute order of a group-by j is a prefix of parent i, compute j without sorting (Cost S(eij)) else first sort (Cost U(eij))11PipeSort (contd...)Proceed level by level, k=0 to k=N-1Find best way of computing level k from level k+1Weighted bipartite matching:Make k additional replicas of each node at level k+1Cost S(eij) on original node, U(eij) on replicasFind minimum cost maximal matching12Min cost matching13Algorithm PipeSortFor level k = 0 to N-1Generate_plan(k+1 k)Fix sort order of level k+1 nodesGenerate_plan(k+1 k):Make k additional replicas of level k+1 nodes, assign appropriate edge costsFind min-cost matching14Example plan15TweaksAggregate and remove duplicates whilst sortingUse partial sorting order to reduce sort costsEg: ABC AC16Hash based methodsAlgorithm PipeHashCan include all stated optimizations: (If memory available)For k=N downto 0For each group-by g at level k+1Compute in 1 scan of g all children for which g is smallest parentSave g and deallocate hash table of g17PipeHashLimited memory Use partitioningOptimization share-partitions:When data is partitioned on attribute A, all group-bys containing A can be computed independently on each partition No recombination required18PipeHash: OverviewFirst choose smallest parent for each group-by (gives MST)Optimize for memory constraints:Decide what group-bys to compute togetherChoice of attribute for data partitioningMinimizing overall disk scan cost: NP-Hard!19PipeHash20HeuristicsOptimizations cache-results and amortize-scans favoured by choosing large subtrees of MST: Can compute multiple group-bys togetherHowever, partitioning attribute limits subtreeHence choose the partitioning attribute that allows choice of largest subtree21Algorithm:Worklist w=MSTWhile w not emptyChoose any tree T from wT’ = select_subtree(T)Compute_subtree(T’)22Select_subtree(T)If mem reqd by T < M, return TElse: For any get subtree TaLet Pa=max # of partitions of root(T) possible if a used for partitioningChoose a s.t. (mem reqd Ta)/Pa<M and Ta is largest subtree over all aAdd forest T-Ta to w, return TaAa 23Compute_subtree(T’)numParts = (mem reqd T’)* fudge_factor/MPartition root(T’) into numPartsFor each partition of root(T’)For each node n in T’ (breadth first)Compute all children of n in 1 scanRelease memory occupied by hash table of n24Notes on PipeHashPipeHash biased towards smallest-parent optimizationEg: Compute BC from BCD (fig)In practice, saving on sequential disk scans less important than reducing CPU cost of aggregation by choosing smallest parent!25Overlap methodSort basedMinimize disk accesses by overlapping computation of “cuboids”Focus: Exploit partially matching sort orders to reduce sorting costsUses smallest parent optimization26Sorted runs B = (A1,A2,...Aj) ; S=(A1,...Al-1,Al+1,...Aj)A sorted runsorted run of S in B is a maximal run of tuples in B whose ordering is consistent with the sort order in SEg:B=[(a,1,2),(a,1,3),(a,2,2),(b,1,3), (b,3,2),(c,3,1)] S=[(a,2),(a,3),(b,3),(b,2),(c,1)] (1st & 3rd)Sorted runs for S: [(a,2),(a,3)],[(a,2)],[(b,3)],[(b,2)] and [(c,1)]27PartitionsB, S have common prefix A1,...,Al-1A partition partition of a cuboid S in B is the union of sorted runs s.t. the first (l-1) columns (ie common prefix) have the same valuePrevious eg: Partitions for S in B are: [(a,2),(a,3)], [(b,3),(b,2)] & [(c,1)]Tuples from different
View Full Document