CORNELL CS 632 - Computing the cube

Unformatted text preview:

Computing the cubeOn the Computation of Multidimensional AggregatesMotivationWhat is a CUBE?ApproachesPossible optimizationsSort based methodsPipeSortSearch lattice: CUBE-BY ABCDSearch lattice (contd...)PipeSort (contd...)Min cost matchingAlgorithm PipeSortExample planTweaksHash based methodsPipeHashPipeHash: OverviewSlide 19HeuristicsAlgorithm:Select_subtree(T)Compute_subtree(T’)Notes on PipeHashOverlap methodSorted runsPartitionsOverviewOverview (contd...)Choosing parent cuboidsExample cuboid treeChoosing overlapping cuboidsSlide 33Cuboid computationCuboid computation (contd...)Example CUBE computationAn array based algorithm for simultaneous multidimensional aggregatesROLAP Vs MOLAPMOLAP systemsProblemEfficient array storage: IssuesChunkingCompressing sparse arraysLoading arrays from tablesBasic algo (No overlapping)GeneralizationBasic array cubing algorithm:Example3-D array (Dim order ABC)Multi-Way algorithmMemory requirementsSlide 52Minimum Memory Spanning TreeMMSTOptimal dimension orderPowerPoint PresentationMulti-pass Multi-way Array AlgorithmHeuristicAlgorithmROLAP vs MOLAP1Computing the cube Abhinandan Das CS 632 Mar 8 20012On the Computation of Multidimensional AggregatesSameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey Naughton, Raghu Ramakrishnan & Sunita Sarawagi -- VLDB 19963MotivationOLAP / Multidimensional data analysisEg: Transactions(Prod,Date,StoreId,Cust,Sales)Sum of sales by: (P,SId) ; (P) ; (P,D,SId)Computing multidimensional aggregates is a performance bottleneckEfficient computation of several related group-bys4What is a CUBE?n-dimensional generalization of the group by operatorGroup-bys corresponding to all possible subsets of a given set of attributesEg: SELECT P, D, C, Sum(Sales) FROM Transactions CUBE-BY P, D, CALL, P, D, C, PD, PC, DC, PDC5ApproachesBasic group-by methods:Sort based Hash based  Naïve approach6Possible optimizations1.1.Smallest parentSmallest parent2.2.Cache resultsCache results3.3.Amortize scansAmortize scans4.4.Share sortsShare sorts5.5.Share partitionsShare partitionsOften contradictoryAssumption: Distributive aggregate functionsum, count, min, max ; average-- Non distributive: median7Sort based methodsAlgorithm PipeSortShare-sorts Vs Smallest parentOptimize to get minimum total costCache-results & amortize-scansPipelining: ABCD  ABC  AB  A8PipeSortAssumption: Have an estimate of the number of distinct values for each group-byInput: Search latticeGraph where each vertex represents a group-by of the cubeEdge i  j if |i|=|j|+1, ji 9Search lattice: CUBE-BY ABCD10Search lattice (contd...)Each edge eij associated with two costs:S(eij): Cost of computing j from i when i is pre-sortedU(eij): Cost of computing j from i when i is not already sortedIdea: If attribute order of a group-by j is a prefix of parent i, compute j without sorting (Cost S(eij)) else first sort (Cost U(eij))11PipeSort (contd...)Proceed level by level, k=0 to k=N-1Find best way of computing level k from level k+1Weighted bipartite matching:Make k additional replicas of each node at level k+1Cost S(eij) on original node, U(eij) on replicasFind minimum cost maximal matching12Min cost matching13Algorithm PipeSortFor level k = 0 to N-1Generate_plan(k+1  k)Fix sort order of level k+1 nodesGenerate_plan(k+1 k):Make k additional replicas of level k+1 nodes, assign appropriate edge costsFind min-cost matching14Example plan15TweaksAggregate and remove duplicates whilst sortingUse partial sorting order to reduce sort costsEg: ABC  AC16Hash based methodsAlgorithm PipeHashCan include all stated optimizations: (If memory available)For k=N downto 0For each group-by g at level k+1Compute in 1 scan of g all children for which g is smallest parentSave g and deallocate hash table of g17PipeHashLimited memory  Use partitioningOptimization share-partitions:When data is partitioned on attribute A, all group-bys containing A can be computed independently on each partition No recombination required18PipeHash: OverviewFirst choose smallest parent for each group-by (gives MST)Optimize for memory constraints:Decide what group-bys to compute togetherChoice of attribute for data partitioningMinimizing overall disk scan cost: NP-Hard!19PipeHash20HeuristicsOptimizations cache-results and amortize-scans favoured by choosing large subtrees of MST: Can compute multiple group-bys togetherHowever, partitioning attribute limits subtreeHence choose the partitioning attribute that allows choice of largest subtree21Algorithm:Worklist w=MSTWhile w not emptyChoose any tree T from wT’ = select_subtree(T)Compute_subtree(T’)22Select_subtree(T)If mem reqd by T < M, return TElse: For any get subtree TaLet Pa=max # of partitions of root(T) possible if a used for partitioningChoose a s.t. (mem reqd Ta)/Pa<M and Ta is largest subtree over all aAdd forest T-Ta to w, return TaAa 23Compute_subtree(T’)numParts = (mem reqd T’)* fudge_factor/MPartition root(T’) into numPartsFor each partition of root(T’)For each node n in T’ (breadth first)Compute all children of n in 1 scanRelease memory occupied by hash table of n24Notes on PipeHashPipeHash biased towards smallest-parent optimizationEg: Compute BC from BCD (fig)In practice, saving on sequential disk scans less important than reducing CPU cost of aggregation by choosing smallest parent!25Overlap methodSort basedMinimize disk accesses by overlapping computation of “cuboids”Focus: Exploit partially matching sort orders to reduce sorting costsUses smallest parent optimization26Sorted runs B = (A1,A2,...Aj) ; S=(A1,...Al-1,Al+1,...Aj)A sorted runsorted run of S in B is a maximal run of tuples in B whose ordering is consistent with the sort order in SEg:B=[(a,1,2),(a,1,3),(a,2,2),(b,1,3), (b,3,2),(c,3,1)] S=[(a,2),(a,3),(b,3),(b,2),(c,1)] (1st & 3rd)Sorted runs for S: [(a,2),(a,3)],[(a,2)],[(b,3)],[(b,2)] and [(c,1)]27PartitionsB, S have common prefix A1,...,Al-1A partition partition of a cuboid S in B is the union of sorted runs s.t. the first (l-1) columns (ie common prefix) have the same valuePrevious eg: Partitions for S in B are: [(a,2),(a,3)], [(b,3),(b,2)] & [(c,1)]Tuples from different


View Full Document

CORNELL CS 632 - Computing the cube

Download Computing the cube
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computing the cube and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computing the cube 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?