Lecture 17 Keyword Search on Relational Databases Oct 31 2007 ChengXiang Zhai Most slides are from Esha Palta and Kumar Gaurav Bijay s presentation CS511 Advanced Database Management Systems 1 Motivation Keyword search We have SQL why keyword querying SQL not appropriate for naive users So many online databases imdb citeseer bseindia user cannot keep track of schema for all of these CS511 Advanced Database Management Systems 2 Simple Approaches Using Form interfaces Require separate form for each type of query confusing Not suitable for ad hoc queries how many forms will you provide How about Google Export data from db to documents and do keywordquerying on these Suffers from duplication overheads Google wants all keywords in one document DB is often normalized so need to join tables and store as documents Multiple combinations of tables to join Not scalable CS511 Advanced Database Management Systems 3 Differences from Web Search Related data split across multiple tuples due to normalization Writes Paper Author Cites PaperId AuthorId AuthorId Citing PaperName AuthorName PaperId Cited The DBLP Bibliography Schema Different keywords may match tuples from different relations What joins are to be computed can only be decided on the fly Need to find result containing all keywords and rank them somehow CS511 Advanced Database Management Systems 4 Systems for DB search BANKS Browsing and Keyword Search IITB ICDE 02 DBXplorer Microsoft Research ICDE 02 ObjectRank IBM UCSD FIU VLDB 04 Bidirectional BANKS IITB VLDB 05 CS511 Advanced Database Management Systems 5 Systems for DB search BANKS Browsing and Keyword Search IITB ICDE 02 DBXplorer Microsoft Research ICDE 02 ObjectRank IBM UCSD FIU VLDB 04 Bidirectional BANKS IITB VLDB 05 CS511 Advanced Database Management Systems 6 BANKS ICDE 02 CS511 Advanced Database Management Systems 7 The BANKS system BANKS Architecture User HTTP BANKS JDBC Web server Available on the web http www cse iitb ac in banks Connects to database using JDBC Database JDBC metadata features used to provide schema browsing Preprocesses db CS511 Advanced Database Management Systems 8 Basic Model Database modeled as a graph Nodes tuples Edges references between tuples foreign key assume for this talk inclusion dependencies Edges are directed BANKS01 Keyword MO MultiQuery Search Optimizn PaperId PaperName paper AuthorID PaperId writes Charuta BANKS01 AuthorId Charuta S Sudarshan DBLP example CS511 Advanced Database Management Systems Prasan Roy author 9 The BANKS Answer Model Query set of search terms t1 t2 tn For each search term ti we find set of nodes Si matching ti Eg Query Sudarshan Roy t1 Sudarshan t2 Roy Answer rooted directed tree connecting nodes matching keywords Root node has special significance may be restricted to some relations E g relations representing entities not relationships May include intermediate nodes not in any S i Steiner Tree Multiple answers Ranking based on proximity prestige CS511 Advanced Database Management Systems 10 Answer Example Query sudarshan MultiQuery roy Paper Optimization Writes Author S Sudarshan Writes Author Prasan Roy We would like to find sets of closely connected tuples that match all given keywords CS511 Advanced Database Management Systems 11 Edge Directionality Directed tree will miss desired answers For eg Query DBXplorer ObjectRank CitedBy Cites BANKS Cites Cited DBXPlorer ObjectRan k Citedforward edge CitedByBANKS adds a back for each So BANKS Cites DBXPlorer edge Cites CS511 Advanced Database Management Systems ObjectRank 12 Edge Directionality What if we ignore directionality Some popular tuples are connected to many other tuples E g Students departments university Problem A popular tuple would create misleading shortcuts between tuples E g every student would be closely linked with every other student via the department university Solution define different forward and backward edge weights Forward edges In the direction of the foreign key reference CS511 Advanced Database Management Systems 13 Edge Weight Weight of forward edge based on schema e g citation link weights writes link weights Weight of backward edge indegree of edges pointing to the node 3 3 1 1 3 1 CS511 Advanced Database Management Systems 14 Edge Weight Scaling Normalize edge score Escore e Problem Some backward edges have unduly large weights Make edge weight scale free by dividing edge weigth by wmin Depress the scale by defining Escore e as log 1 w e wmin Overall Escore E 1 1 Escore e CS511 Advanced Database Management Systems 15 Node Weight Set weight of a Node Indegree of the node As per prestige rankings nodes with multiple pointers to them get a higher prestige So higher node weight corresponds to higher prestige Problem Nodes with many in edges result in skewed answers Subdue extreme node weights by using log 1 indegree Node score Nscore Average of node scores rootnode weight leaf node weights CS511 Advanced Database Management Systems 16 Combining Scores Combining two independent metrics node weight and edge weight Normalize each to 0 1 Combine using weighting factor Additive 1 Escore Nscore Multiplicative Escore Nscore Performance study to compare alternatives and to find reasonable values for CS511 Advanced Database Management Systems 17 First Step Symbol Table The first step is to build a symbol table This table is in the db and is not normalized Example Keyword List of Matching Nodes Database NICDE 2 NVLDB 3 Search NBANKS1 NBANKS2 NDBXPLR Rank NOBJRNK NXRANK NSPHSRCH CS511 Advanced Database Management Systems 18 Searching for Best Answers Backward Expanding Search Algorithm Assume graph fits in memory Idea find vertices from which a forward path exists to at least one node from each Si Run concurrent single source shortest path algorithm from each node matching a keyword Create an iterator for each node matching a keyword Traverse the graph edges in reverse direction Output a node whenever it is on the intersection of the sets of nodes reached from each keyword Answer trees may not be generated in relevance order CS511 Advanced Database Management Systems 19 Backward Expanding Search Query sudarshan roy MultiQuery Optimization paper writes authors Prasan Roy S Sudarshan CS511 Advanced Database Management Systems Iterators 20 BANKS Query Result Example Result of Sudarshan Roy CS511 Advanced Database Management Systems 21 Result Ordering Answers need not be always in Relevance order This tree is output Better Root Missed 5 CS511 Advanced
View Full Document