TEMPLE CIS 664 - A Model-Based Semantic Compression System for Massive Data Tables - D2372721

Home> Schools> Temple University> Computer & Information Science (CIS) > CIS 664> A Model-Based Semantic Compression System for Massive Data Tables

TEMPLE CIS 664 - A Model-Based Semantic Compression System for Massive Data Tables

School name Temple University

Course Cis 664- Knowledge Discovery and Data Mining

Pages 23

Download Save

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23CIS664 KD&DMSPARTAN: A Model-Based Semantic Compression System for Massive Data TablesS. Babu, M. Garofalakis, R. RastogiPresented by Uroš MidićApr 11, 2007IntroductionDefinitionsFascicles compression methodModel-based semantic compressionSPARTANExperimental resultsResults3Introduction•Massive amounts of data are being produced daily in various environments:•Corporate: communications traffic, banking•Scientific: high throughput experiments, sensor networks, satellites•Governmental: surveys•Compression can reduce the amount of space for storage, and bandwidth for transfer of data.4Introduction•Many statistical and dictionary-based lossless compression methods have been developed, that were proven to be (asymptotically) optimal with respect to various criteria•Some very effective lossy compression methods have been developed for specific applications, such as multimedia: pictures, sound, movies5Introduction•When dealing with large databases, traditional syntactic compression approaches are usually not very effective in general case. •However, there may be many semantic patterns that syntactic compression approaches do not exploit.•Due to the nature of many data-analysis applications, we may afford to use lossy compression techniques, as long as it provides a certain upper bound on the compression error.6Definitions•n-attribute table T•X = {X1, …, Xn} is the set of n attributes of T and dom(Xi) is the domain of Xi•Tc is the compressed version of T•|Tc| and |T| denote the storage-space requirements (in bytes) for Tc and T•e = [e1, …, en] defines the per-attribute acceptable degree of information loss7Definitions•e = [e1, …, en] defines the per-attribute acceptable degree of information loss•For a numeric attribute Xi, ei defines an upper bound on the absolute difference between the actual value of Xi in T and the approximate value in Tc.•For categorical values, ei defines an upper bound on the probability that the approximate value of Xi in Tc is different than the actual value in T.8Fascicles compression method•The “Fascicles” method searches for groups of rows in the table (i.e. fascicles of rows) that have similar values for some attributes. The values for these attributes in a fascicle can be approximated by one representative.•Example: e = [2, 5000, 25000, 0]Original Compressed30 90,000 200,000 Good50 110,000 250,000 Good70 35,000 125,000 Poor75 15,000 100,000 Poor30 90,000 225,000 Good50 110,00070 35,000 112,500 Poor75 15,0009Model-Based Semantic CompressionGiven a table T and a vector of tolerances e, find a subset {X1, …, Xp} of X,set of models {M1, …, Mp} such that model Mi is a predictor for values of attribute Xi based on attributes in X \ {X1, …, Xp} and the error bound ei is not exceeded.Models Mi come from a predefined family of models (e.g. CaRTs).10Model-Based Semantic CompressionThen TC = <T’, {M1, …, Mp}>, where T’ is the projection of T to X \ {X1, …, Xp}, is a compressed version of T.The problem is to find {X1, …, Xp} and {M1, …, Mp} such that the storage requirements of the compressed table |Tc| are minimized.11SPARTANOverview12SPARTANDependencyFinderFinds structure of Bayesian network from data.●Constraint-based methods●Scoring-based methodsSPARTAN uses a scoring-based algorithm that utilizes pairwise computational tests based on mutual information divergence. It requires at most O(n4) CI tests. Each CI test requires one pass – therefore only a sample of table T is used.13SPARTANCaRTSelector●The problem of choosing a storage-optimal subset of attributes Xpred to be predicted using attributes X-Xpred is NP-hard.●SPARTAN uses two heuristic approaches to select Xpred. Both approaches rely on the Bayesian network provided by the DependencyFinder.14SPARTANCaRTSelector – Greedy approach●Greedy selection algorithm visits the attributes in the order imposed by the Bayesian network:1. If Xi has no parent nodes in BN, it is placed in Xmat (set of attributes to be materialized).2. If Xi has parent nodes, then CaRTBuilder is invoked to construct a CaRT-based predictor for Xi (within the specified error tolerance ei) using the attributes that are currently in Xmat. Xi will be chosen for prediction if the estimated storage benefit is at least θ (input parameter), and materialized otherwise.15SPARTANCaRTSelector – Greedy approach+ Greedy constructs at most (n-1) CaRT predictors.- The problem is that the decision whether Xi will be predicted or materialized is done solely on its “localized” prediction benefit.- Another problem is that it involves the parameter θ:•High θ may mean that too few attributes are chosen for prediction.•Low θ may mean that low-benefit predictors are chosen early, which may exclude some high-benefit predictors later.16SPARTANCaRTSelector – MaxIndependentSet selector1. Xmat = X2. For each Xi in Xmat build CaRT for Xi based on its “predictive neighborhood” and estimate the storage-cost benefit.3. Build a node-weighted undirected graph Gtemp on X with:●all edges (Xi,Y) where Y is used in the predictor for Xi●weights for each node Xi set equal to the storage-cost benefit of predicting Xi4. Run a heuristic for Weighted Maximum Independent Set problem on Gtemp.5. If the sum of storage-cost benefits for attributes corresponding to the nodes in result S of WMIS is negative, finish.6. Othervise, move these nodes from Xmat to Xpred and go to 2.17SPARTANCaRTSelector – MaxIndependentSet selectorWe look for a Weighted Maximum Independent Set in Gtemp because:1. We want to maximize the storage-cost benefits (sum of weights in the nodes).2. We do not want to select a pair of nodes that are connected with an edge (because one would be needed to predict another).Note that in step 2 we need to take the account the possible negative effect of moving Xi to Xpred, because it might have been previously assigned to be used in prediction of some Xj. Therefore the benefit may be negative.18SPARTANCaRTSelector – MaxIndependentSet selector●This heuristic builds at most O(rn2) predictors and runs WMIS solution heuristic at most O(n) times (r is the upper bound on the number of predictor attributes used in any of the CaRTs).●With very loose assumptions this becomes O(rnlog n) and O(log

View Full Document


School:
Email:
New Password:
Confirm Password:

TEMPLE CIS 664 - A Model-Based Semantic Compression System for Massive Data Tables

Sign up for free to view:

Please select your school