New version page

Consensus Answers for Queries over Probabilistic Databases

Upgrade to remove ads

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

Jian Li and Amol DeshpandeUniversity of Maryland, College Park, USAConsensus Answers for Queries over Probabilistic DatabasesProbabilistic Databases Motivation: Increasing amounts of uncertain data Sensor Networks; Information Networks Noisy input data; measurement errors; incomplete data Prevalent use of probabilistic modeling techniques Data Integration and Information Extraction Need to model reputation, trust, and data quality Increasing use of automated tools for schema mapping etc. … Probabilistic databases Annotate tuples with existence probabilities, and attribute values with probability distributions Propagate probabilities through query execution Interpretation according to the "possible worlds semantics"Prob DB: DAll Possible Worlds:pw1,pw2,.…..All Possible Answers:Answer(pw1),Answer(pw2),……Answer(D)Exponential Size !!combineshould be efficientQuerySemantics of Query ProcessingSemantics of Query ProcessingHow to Combine? Allow probabilistic answers. Return all possible tuples along with prob. [Dalvi, Suciu ’04]  Return tuples with annotations [Green et al. ’06] What if we want a single deterministic answer? Probabilistic thresholding [Dalvi, Suciu ’04]  Return all tuples s.t. t appears in the answer w.p. >=Threshold Sampling Top-k queries ? Many prior proposals for combining them U-top-k, U-rank-k [Soliman et al. ’07] Probabilistic Threshold (PT-k) [Hua et al. ’08] Global-top-k [Zhang et al. ’08] Expected Rank [Cormode et al. ’09] Parameterized Ranking Function (PRF) [Li et al. ’09]pw 1:t1t2t3…pw 2:t1t3t4…pw 3:t2t3t5…pw 4:t2t4t5…But, formal semantics are lacking.Semantics of Top-k Queries…Consensus Answers Think of each possible answer as a point in the space.Suppose d() is a distance metric between answers. Consensus Answers:A single deterministic answer• Mean Answers: is the set of feasible answers• Median Answers: is the set of possible answerswhere ¿pwis the answer for the possible world pwConsensus Answersanswer 1w.p. 0.1answer 2w.p. 0.3answer 3w.p. 0.15answer 4w.p. 0.2answer 5w.p. 0.05answer 6w.p. 0.2the meananswerCentroid / Center of Masspw 1:t1t2t7…pw 3:t1t3t4…pw 2:t1t2t4…pw 6:t3t6t7…pw 4:t2t3t4…pw 5:t3t4t5…ma:t1t4t5…the median answerRelated Work Rank Aggregation [Dwork et al. ’01], [Ailon ‘07]  Original work in voting systems [Condorcet ‘1785] Goal: Combine rankings provided by different experts Consensus Clustering [Ailon et al. ’08] Goal: Aggregate a set of clusterings to minimize the disagreements Probabilistic Query Processing  Dichotomy result: Conjunctive query evaluation is either PTIME or #P-Complete [Dalvi , Suciu ’04] Finding consensus answers a much harder problem (NP-hard even if there is a safe plan)Outline Problem Definition: Consensus Answers  Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries Conclusion Tuple-independence ModelThe existence of each tuple is independent of other tuples Block-independent Disjoint (BID) SchemeKey Attr 1 Prob1 500 0.51 950 0.32 20 0.32 30 0.23 150 0.23 200 0.8Tuples with the same key are mutually exclusive.Probabilistic Database ModelsProbabilistic Database Models Probabilistic And/Xor Trees Capture two types of correlations: mutual exclusitivity and coexistence.V(1,500)VVV(1,950) (2,20) (2,30) (3,150) (3,200)And node:Xor nodes:0.5 0.3 0.3 0.2 0.2 0.8Possible Worlds Pr(3,150) 0.02(3,200) 0.08…….(1,500);(2,20);(3,150) 0.03(1,950);(2,20);(3,150) 0.018…….Probabilistic Database Models Probabilistic And/Xor Trees Capture two types of correlations: mutual exclusitivity and coexistence.V(1,500)VVV(1,950) (2,20) (2,30) (3,150) (3,200)And node:Xor nodes:0.5 0.3 0.3 0.2 0.2 0.8Possible Worlds Pr(3,150) 0.02(3,200) 0.08…….(1,500);(2,20);(3,150) 0.03(1,950);(2,20);(3,150) 0.018…….(1-0.5-0.3)*(1-0.3-0.2)*0.2=0.02Probabilistic Database Models Probabilistic And/Xor Trees(1,20)V(2,50) (2,20) (3,35) (1,30) (3,25)And nodes:Xor node:0.50.30.2Possible Worlds Pr(1,20);(2,50) 0.5(2,20);(3,35) 0.3(1,30);(3,25) 0.2VVV• And/Xor trees can represent any finite set of possible worlds (not necessarily compact).Computing Probabilities on And/Xor TreesGenerating Function Method:q+p1F1(x,y,…)+p2F2(x,y,…)+p3F3(x,y,…)VF1(x,y,…)F2(x,y,…)F3(x,y,…)F1(x,y,…)F2(x,y,…)F3(x,y,…)VF1(x,y,…)F2(x,y,…)F3(x,y,…)p1p2p3q=1-p1-p2-p3And Node:Xor Node:Leaves:x y x zComputing Probabilities on And/Xor TreesGenerating Function Method:Root:F(x,y,…)=ij…cij…xiyj….THM: The coefficient cij…of the term xiyj…. = total prob of the possible worlds which containi tuples annotated with x,j tuples annotated with y,……Computing Probabilities on And/Xor TreesV(1,500)VVV(1,950) (2,20) (2,30) (3,150) (3,200)0.5 0.3 0.3 0.2 0.2 0.8xxx xx x0.2+0.8x 0.5+0.5x x(0.2+0.8x)(0.5+0.5x)x = 0.4 x3+0.5 x2+0.1 x Pr(|pw|=3)=0.4Pr(|pw|=2)=0.5Pr(|pw|=1)=0.1Example: Computing the prob. dist. of the size of the pwComputing Probabilities on And/Xor TreesExample: Computing the rank distributionr(i) : the rank of tuple i.r(i)=j if and only if (1) j-1 tuples with higher scores appear(2) tuple i appears(1,500) (2,950) (3,200)(4,350)(5,550)x1x y x0.5 0.8 0.4 0.60.9VVVVV V10.2+0.8x0.5+0.5x0.4+0.6y0.1+0.9x(0.5+0.5x)(0.2+0.8x)(0.4+0.6y)(0.1+0.9x)Pr(r(i)=j) = coeff of xj-1yOutline Problem Definition: Consensus Answers  Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries ConclusionSet Distance Metrics Think of the relations (either existing or results of conjunctive queries) as sets. Symmetric Difference: THM: For conjunctive queries over tuple independent databases, finding the median answer under the symmetric difference distance is NP-Hard (even if the query has a safe plan).Reduction from MAX-2-SATTHM: The mean answer under the symmetric difference distance is the set of all tuples with probability >0.5.Set Distance Metrics Jaccard Distance LM: For tuple independent databases, if the mean world contains tuple t1but not tuple t2, then Pr (t1) > Pr (t2). Hence, suffices to sort by probabilities, and consider prefixes LM: For any fixed world W, E[dJ(W,pw)] can be computed in polynomial time (using generating functions) Gives us a polynomial time algorithmOutline Problem Definition: Consensus Answers 


Download Consensus Answers for Queries over Probabilistic Databases
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Consensus Answers for Queries over Probabilistic Databases and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Consensus Answers for Queries over Probabilistic Databases 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?