Jian Li and Amol DeshpandeUniversity of Maryland, College Park, USAConsensus Answers for Queries over Probabilistic DatabasesProbabilistic Databases Motivation: Increasing amounts of uncertain data Sensor Networks; Information Networks Noisy input data; measurement errors; incomplete data Prevalent use of probabilistic modeling techniques Data Integration and Information Extraction Need to model reputation, trust, and data quality Increasing use of automated tools for schema mapping etc. … Probabilistic databases Annotate tuples with existence probabilities, and attribute values with probability distributions Propagate probabilities through query execution Interpretation according to the "possible worlds semantics"Prob DB: DAll Possible Worlds:pw1,pw2,.…..All Possible Answers:Answer(pw1),Answer(pw2),……Answer(D)Exponential Size !!combineshould be efficientQuerySemantics of Query ProcessingSemantics of Query ProcessingHow to Combine? Allow probabilistic answers. Return all possible tuples along with prob. [Dalvi, Suciu ’04] Return tuples with annotations [Green et al. ’06] What if we want a single deterministic answer? Probabilistic thresholding [Dalvi, Suciu ’04] Return all tuples s.t. t appears in the answer w.p. >=Threshold Sampling Top-k queries ? Many prior proposals for combining them U-top-k, U-rank-k [Soliman et al. ’07] Probabilistic Threshold (PT-k) [Hua et al. ’08] Global-top-k [Zhang et al. ’08] Expected Rank [Cormode et al. ’09] Parameterized Ranking Function (PRF) [Li et al. ’09]pw 1:t1t2t3…pw 2:t1t3t4…pw 3:t2t3t5…pw 4:t2t4t5…But, formal semantics are lacking.Semantics of Top-k Queries…Consensus Answers Think of each possible answer as a point in the space.Suppose d() is a distance metric between answers. Consensus Answers:A single deterministic answer• Mean Answers: is the set of feasible answers• Median Answers: is the set of possible answerswhere ¿pwis the answer for the possible world pwConsensus Answersanswer 1w.p. 0.1answer 2w.p. 0.3answer 3w.p. 0.15answer 4w.p. 0.2answer 5w.p. 0.05answer 6w.p. 0.2the meananswerCentroid / Center of Masspw 1:t1t2t7…pw 3:t1t3t4…pw 2:t1t2t4…pw 6:t3t6t7…pw 4:t2t3t4…pw 5:t3t4t5…ma:t1t4t5…the median answerRelated Work Rank Aggregation [Dwork et al. ’01], [Ailon ‘07] Original work in voting systems [Condorcet ‘1785] Goal: Combine rankings provided by different experts Consensus Clustering [Ailon et al. ’08] Goal: Aggregate a set of clusterings to minimize the disagreements Probabilistic Query Processing Dichotomy result: Conjunctive query evaluation is either PTIME or #P-Complete [Dalvi , Suciu ’04] Finding consensus answers a much harder problem (NP-hard even if there is a safe plan)Outline Problem Definition: Consensus Answers Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries Conclusion Tuple-independence ModelThe existence of each tuple is independent of other tuples Block-independent Disjoint (BID) SchemeKey Attr 1 Prob1 500 0.51 950 0.32 20 0.32 30 0.23 150 0.23 200 0.8Tuples with the same key are mutually exclusive.Probabilistic Database ModelsProbabilistic Database Models Probabilistic And/Xor Trees Capture two types of correlations: mutual exclusitivity and coexistence.V(1,500)VVV(1,950) (2,20) (2,30) (3,150) (3,200)And node:Xor nodes:0.5 0.3 0.3 0.2 0.2 0.8Possible Worlds Pr(3,150) 0.02(3,200) 0.08…….(1,500);(2,20);(3,150) 0.03(1,950);(2,20);(3,150) 0.018…….Probabilistic Database Models Probabilistic And/Xor Trees Capture two types of correlations: mutual exclusitivity and coexistence.V(1,500)VVV(1,950) (2,20) (2,30) (3,150) (3,200)And node:Xor nodes:0.5 0.3 0.3 0.2 0.2 0.8Possible Worlds Pr(3,150) 0.02(3,200) 0.08…….(1,500);(2,20);(3,150) 0.03(1,950);(2,20);(3,150) 0.018…….(1-0.5-0.3)*(1-0.3-0.2)*0.2=0.02Probabilistic Database Models Probabilistic And/Xor Trees(1,20)V(2,50) (2,20) (3,35) (1,30) (3,25)And nodes:Xor node:0.50.30.2Possible Worlds Pr(1,20);(2,50) 0.5(2,20);(3,35) 0.3(1,30);(3,25) 0.2VVV• And/Xor trees can represent any finite set of possible worlds (not necessarily compact).Computing Probabilities on And/Xor TreesGenerating Function Method:q+p1F1(x,y,…)+p2F2(x,y,…)+p3F3(x,y,…)VF1(x,y,…)F2(x,y,…)F3(x,y,…)F1(x,y,…)F2(x,y,…)F3(x,y,…)VF1(x,y,…)F2(x,y,…)F3(x,y,…)p1p2p3q=1-p1-p2-p3And Node:Xor Node:Leaves:x y x zComputing Probabilities on And/Xor TreesGenerating Function Method:Root:F(x,y,…)=ij…cij…xiyj….THM: The coefficient cij…of the term xiyj…. = total prob of the possible worlds which containi tuples annotated with x,j tuples annotated with y,……Computing Probabilities on And/Xor TreesV(1,500)VVV(1,950) (2,20) (2,30) (3,150) (3,200)0.5 0.3 0.3 0.2 0.2 0.8xxx xx x0.2+0.8x 0.5+0.5x x(0.2+0.8x)(0.5+0.5x)x = 0.4 x3+0.5 x2+0.1 x Pr(|pw|=3)=0.4Pr(|pw|=2)=0.5Pr(|pw|=1)=0.1Example: Computing the prob. dist. of the size of the pwComputing Probabilities on And/Xor TreesExample: Computing the rank distributionr(i) : the rank of tuple i.r(i)=j if and only if (1) j-1 tuples with higher scores appear(2) tuple i appears(1,500) (2,950) (3,200)(4,350)(5,550)x1x y x0.5 0.8 0.4 0.60.9VVVVV V10.2+0.8x0.5+0.5x0.4+0.6y0.1+0.9x(0.5+0.5x)(0.2+0.8x)(0.4+0.6y)(0.1+0.9x)Pr(r(i)=j) = coeff of xj-1yOutline Problem Definition: Consensus Answers Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries ConclusionSet Distance Metrics Think of the relations (either existing or results of conjunctive queries) as sets. Symmetric Difference: THM: For conjunctive queries over tuple independent databases, finding the median answer under the symmetric difference distance is NP-Hard (even if the query has a safe plan).Reduction from MAX-2-SATTHM: The mean answer under the symmetric difference distance is the set of all tuples with probability >0.5.Set Distance Metrics Jaccard Distance LM: For tuple independent databases, if the mean world contains tuple t1but not tuple t2, then Pr (t1) > Pr (t2). Hence, suffices to sort by probabilities, and consider prefixes LM: For any fixed world W, E[dJ(W,pw)] can be computed in polynomial time (using generating functions) Gives us a polynomial time algorithmOutline Problem Definition: Consensus Answers
or
We will never post anything without your permission.
Don't have an account? Sign up