Consensus Answers for Queries over Probabilistic Databases

Home> Academic Documents> Consensus Answers for Queries over Probabilistic Databases

DOC PREVIEW

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Jian Li and Amol DeshpandeUniversity of Maryland, College Park, USAConsensus Answers for Queries over Probabilistic DatabasesProbabilistic Databases Motivation: Increasing amounts of uncertain data Sensor Networks; Information Networks Noisy input data; measurement errors; incomplete data Prevalent use of probabilistic modeling techniques Data Integration and Information Extraction Need to model reputation, trust, and data quality Increasing use of automated tools for schema mapping etc. … Probabilistic databases Annotate tuples with existence probabilities, and attribute values with probability distributions Propagate probabilities through query execution Interpretation according to the "possible worlds semantics"Prob DB: DAll Possible Worlds:pw1,pw2,.…..All Possible Answers:Answer(pw1),Answer(pw2),……Answer(D)Exponential Size !!combineshould be efficientQuerySemantics of Query ProcessingSemantics of Query ProcessingHow to Combine? Allow probabilistic answers. Return all possible tuples along with prob. [Dalvi, Suciu ’04]  Return tuples with annotations [Green et al. ’06] What if we want a single deterministic answer? Probabilistic thresholding [Dalvi, Suciu ’04]  Return all tuples s.t. t appears in the answer w.p. >=Threshold Sampling Top-k queries ? Many prior proposals for combining them U-top-k, U-rank-k [Soliman et al. ’07] Probabilistic Threshold (PT-k) [Hua et al. ’08] Global-top-k [Zhang et al. ’08] Expected Rank [Cormode et al. ’09] Parameterized Ranking Function (PRF) [Li et al. ’09]pw 1:t1t2t3…pw 2:t1t3t4…pw 3:t2t3t5…pw 4:t2t4t5…But, formal semantics are lacking.Semantics of Top-k Queries…Consensus Answers Think of each possible answer as a point in the space.Suppose d() is a distance metric between answers. Consensus Answers:A single deterministic answer• Mean Answers: is the set of feasible answers• Median Answers: is the set of possible answerswhere ¿pwis the answer for the possible world pwConsensus Answersanswer 1w.p. 0.1answer 2w.p. 0.3answer 3w.p. 0.15answer 4w.p. 0.2answer 5w.p. 0.05answer 6w.p. 0.2the meananswerCentroid / Center of Masspw 1:t1t2t7…pw 3:t1t3t4…pw 2:t1t2t4…pw 6:t3t6t7…pw 4:t2t3t4…pw 5:t3t4t5…ma:t1t4t5…the median answerRelated Work Rank Aggregation [Dwork et al. ’01], [Ailon ‘07]  Original work in voting systems [Condorcet ‘1785] Goal: Combine rankings provided by different experts Consensus Clustering [Ailon et al. ’08] Goal: Aggregate a set of clusterings to minimize the disagreements Probabilistic Query Processing  Dichotomy result: Conjunctive query evaluation is either PTIME or #P-Complete [Dalvi , Suciu ’04] Finding consensus answers a much harder problem (NP-hard even if there is a safe plan)Outline Problem Definition: Consensus Answers  Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries Conclusion Tuple-independence ModelThe existence of each tuple is independent of other tuples Block-independent Disjoint (BID) SchemeKey Attr 1 Prob1 500 0.51 950 0.32 20 0.32 30 0.23 150 0.23 200 0.8Tuples with the same key are mutually exclusive.Probabilistic Database ModelsProbabilistic Database Models Probabilistic And/Xor Trees Capture two types of correlations: mutual exclusitivity and coexistence.V(1,500)VVV(1,950) (2,20) (2,30) (3,150) (3,200)And node:Xor nodes:0.5 0.3 0.3 0.2 0.2 0.8Possible Worlds Pr(3,150) 0.02(3,200) 0.08…….(1,500);(2,20);(3,150) 0.03(1,950);(2,20);(3,150) 0.018…….Probabilistic Database Models Probabilistic And/Xor Trees Capture two types of correlations: mutual exclusitivity and coexistence.V(1,500)VVV(1,950) (2,20) (2,30) (3,150) (3,200)And node:Xor nodes:0.5 0.3 0.3 0.2 0.2 0.8Possible Worlds Pr(3,150) 0.02(3,200) 0.08…….(1,500);(2,20);(3,150) 0.03(1,950);(2,20);(3,150) 0.018…….(1-0.5-0.3)*(1-0.3-0.2)*0.2=0.02Probabilistic Database Models Probabilistic And/Xor Trees(1,20)V(2,50) (2,20) (3,35) (1,30) (3,25)And nodes:Xor node:0.50.30.2Possible Worlds Pr(1,20);(2,50) 0.5(2,20);(3,35) 0.3(1,30);(3,25) 0.2VVV• And/Xor trees can represent any finite set of possible worlds (not necessarily compact).Computing Probabilities on And/Xor TreesGenerating Function Method:q+p1F1(x,y,…)+p2F2(x,y,…)+p3F3(x,y,…)VF1(x,y,…)F2(x,y,…)F3(x,y,…)F1(x,y,…)F2(x,y,…)F3(x,y,…)VF1(x,y,…)F2(x,y,…)F3(x,y,…)p1p2p3q=1-p1-p2-p3And Node:Xor Node:Leaves:x y x zComputing Probabilities on And/Xor TreesGenerating Function Method:Root:F(x,y,…)=ij…cij…xiyj….THM: The coefficient cij…of the term xiyj…. = total prob of the possible worlds which containi tuples annotated with x,j tuples annotated with y,……Computing Probabilities on And/Xor TreesV(1,500)VVV(1,950) (2,20) (2,30) (3,150) (3,200)0.5 0.3 0.3 0.2 0.2 0.8xxx xx x0.2+0.8x 0.5+0.5x x(0.2+0.8x)(0.5+0.5x)x = 0.4 x3+0.5 x2+0.1 x Pr(|pw|=3)=0.4Pr(|pw|=2)=0.5Pr(|pw|=1)=0.1Example: Computing the prob. dist. of the size of the pwComputing Probabilities on And/Xor TreesExample: Computing the rank distributionr(i) : the rank of tuple i.r(i)=j if and only if (1) j-1 tuples with higher scores appear(2) tuple i appears(1,500) (2,950) (3,200)(4,350)(5,550)x1x y x0.5 0.8 0.4 0.60.9VVVVV V10.2+0.8x0.5+0.5x0.4+0.6y0.1+0.9x(0.5+0.5x)(0.2+0.8x)(0.4+0.6y)(0.1+0.9x)Pr(r(i)=j) = coeff of xj-1yOutline Problem Definition: Consensus Answers  Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries ConclusionSet Distance Metrics Think of the relations (either existing or results of conjunctive queries) as sets. Symmetric Difference: THM: For conjunctive queries over tuple independent databases, finding the median answer under the symmetric difference distance is NP-Hard (even if the query has a safe plan).Reduction from MAX-2-SATTHM: The mean answer under the symmetric difference distance is the set of all tuples with probability >0.5.Set Distance Metrics Jaccard Distance LM: For tuple independent databases, if the mean world contains tuple t1but not tuple t2, then Pr (t1) > Pr (t2). Hence, suffices to sort by probabilities, and consider prefixes LM: For any fixed world W, E[dJ(W,pw)] can be computed in polynomial time (using generating functions) Gives us a polynomial time algorithmOutline Problem Definition: Consensus Answers 


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Please select your school