New version page

Representing Tuple and Attribute Uncertainty in Probabilistic Databases

This preview shows page 1-2 out of 6 pages.

View Full Document
View Full Document

End of preview. Want to read all 6 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Representing Tuple and Attribute Uncertainty in Probabilistic DatabasesPrithviraj [email protected] [email protected] [email protected] Science Department, University of Maryland, College Park, MD, 20742.AbstractThere has been a recent surge in work in probabilisticdatabases, propelled in large part by the huge increase innoisy data sources — sensor data, experimental data, datafrom uncurated sources, and many others. There is a grow-ing need to be able to flexibly represent the uncertaintiesin the data, and to efficiently query the data. Building onexisting probabilistic database work, we present a unifyingframework which allows a flexible representation of corre-lated tuple and attribute level uncertainties. An importantcapability of our representation is the ability to representshared correlation structures in the data. We provide moti-vating examples to illustrate when such shared correlationstructures are likely to exist. Representing shared corre-lations structures allows the use of sophisticated inferencetechniques based on lifted probabilistic inference that, inturn, allows us to achieve significant speedups while com-puting probabilities for results of user-submitted queries.1 IntroductionAn increasing number of real-world applications are de-manding support for managing, storing, and querying un-certain data in relational database systems. This has led toa recent resurgence of research in the area of probabilis-tic databases. This work has spanned a wide range of is-sues such as query optimization [11], indexing uncertaindata [22], query languages [4, 2] and data models that canrepresent complex correlations [13, 20, 21] naturally pro-duced from various applications [14, 13, 3]. The underly-ing data models for much of this work are based on proba-bility theory coupled with possible worlds semantics wherea probabilistic database is defined as a probability distribu-tion over numerous possible databases. Although this def-inition results in intuitive querying semantics, probabilisticdatabases are still far from being efficient and scalable, withthe query processing problems known to be #P-completeeven for very simple queries [11].As opposed to query evaluation for traditional databasescontaining exact data, query evaluation for probabilisticdatabases requires that we calculate probabilities associatedwith result tuples present in the answer to user-submittedqueries [11], and this, in turn, requires probabilistic infer-ence [21]. The main reason for most of the complexityresults associated with probabilistic databases is a conse-quence of this close connection with probabilistic inference,which is known to be an NP-Hard problem, in general [9].However, in many cases, it is possible to leverage specialproperties of the uncertain data at hand to reduce the com-plexity of query evaluation. One such property is the pres-ence of shared correlation structures. Many applicationsproduce uncertain data with probability distributions andcorrelations copied many times over; this is because thedistributions and correlations typically come from generalstatistics that do not vary on a per-tuple basis. For instance,Andritsos et al. [1] report a customer relationship manage-ment application where the objective is to merge data fromtwo or more source databases and each source database isassigned a probability value based on the quality of the in-formation it contains. Even here, probabilities don’t changefrom tuple to tuple, since tuples from the same source areassigned the same source probability. Also, Dalvi and Su-ciu [12] report uncertain data collected as part of the CancerGenome Anatomy Project that assigns tags to genes and thisinformation is uncertain because of the inherent uncertain-ties in the experiments conducted to produce the tag-geneassociation. Once again, the uncertainties in the experimen-tal setup are likely to vary in systematic ways from one ex-periment to another, and will result in repetition in probabil-ity values. As we show in this paper, it is possible to exploitsuch shared correlation structures using sophisticated infer-ence techniques to significantly reduce the time associatedwith probabilistic inference for query evaluation.One prerequisite to being able to exploit shared correla-tion structures is to design a probabilistic database that canrepresent the shared correlation structures in the base datain the first place. Unfortunately, most of the recent workin probabilistic databases has been concentrated on the de-velopment of tuple-level uncertainty probabilistic databases[15, 11, 13, 21]. Many applications produce data that ismore naturally represented using attribute-level uncertaintyas opposed to tuple-level uncertainty, e.g., sensor network1?????MPG1/11/11/151/101/1Date$20,000CivicHybrid202105$20,000CivicHybrid202104$12,000Civic??103$4,000Civic (DX)Sedan201102$6,000Civic (EX)Sedan201101PriceModelTypeSellerIDAdID?????MPG1/11/11/151/101/1Date$20,000CivicHybrid202105$20,000CivicHybrid202104$12,000Civic??103$4,000Civic (DX)Sedan201102$6,000Civic (EX)Sedan201101PriceModelTypeSellerIDAdIDGood202Shady201ReputationSellerIDGood202Shady201ReputationSellerID0.42020.6201f(t103.SellerId)t103.SellerId0.428CivicSedan0.6355045373532302826MPG0.60.40.20.70.10.20.60.2f(Type,Model,MPG)CivicHybridCivic (DX)SedanCivic (EX)SedanModelType0.21050.21040.81030.61020.3101f(t.E)AdID0.7Hybrid0.3Sedanf(t103.Type)t103.Type0.1FalseTrue0.1TrueFalseFalseTruet101.E0.9False0.8Truef(t101.E, t102.E)t102.E(a) (b)Figure 1: (a) A simple car advertisement database (b) Factors describing the probabilities and correlations among tuples and attributes.datasets [14], mobile object databases [8] etc. (see Bar-bara et al [3] for more examples). Although convertinga database with attribute-level uncertainty to a databasewith tuple-level uncertainty is a fairly simple operation, thistransformation usually leads to a loss of shared correlationstructures (besides resulting in a database that requires stor-ing a significantly larger number of tuples).In general, we may require both tuple and attribute leveluncertainty to faithfully represent the shared correlationstructures present in real-world data and, for this purpose,in this paper, we propose a probabilistic database modelthat can represent correlated attribute and tuple level un-certainty, thus fully exploiting the representational powerof the probability theory. Our proposed model uses smallfunctions called


Loading Unlocking...
Login

Join to view Representing Tuple and Attribute Uncertainty in Probabilistic Databases and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Representing Tuple and Attribute Uncertainty in Probabilistic Databases and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?