1Web Searching & IndexingCPS 116Introduction to Database Systems2Announcements (December 2) Homework #4 sample solution available Course project demo period: December 8-13 Each project gets a 30-minute slot with me Email me to schedule demo slots Final exam next Saturday, Dec. 13, 7-10pm Early final next Wednesday, Dec. 10, 9am-12pm Again, open book, open notes Focus on the second half of the course Sample final available Sample final solution available Thursday3Keyword searchGoogle…Web | Images | Groups| DirectoryGoogle Search | I’mFeeling LuckyAdvanced Search |Preferences | LanguageAssociation forComputing MachineryFounded in 1947,ACM is the world’sfirst educational andscientific computingsociety Today ourCPS 216: AdvancedDatabase Systems(Fall 2008)Course InformationCourse Description /Ti d Pl /The Internet MovieDatabase (IMDb)…… Search the InternetMovie Database. ForPreferences | LanguageTools…society. To d a y, ourmembers—…Time and Place /BooksResources: Staff…more search options,please visit Searchcentral…database AND search SearchWhat are the documents containing both “database” and “search”?24Keywords × documentsAll documents“a”“cat”All keywords111…1110…0 Inverted lists: store the matrix by rows Signature files: store the matrix by columns“database”“dog”“search”1 means keyword appears in the document;0 means otherwise001…0010…1001…0……………5Inverted lists Store the matrix by rows For each keyword, store an inverted list hkeyword, doc-id-listi h“database”, {3, 7, 142, 857, …}ih“ h” {3 9 192 512 }ih“search”, {3, 9, 192, 512, …}i It helps to sort doc-id-list (why?) Vocabulary index on keywords B+-tree or hash-based How large is an inverted list index?6Using inverted lists Documents containing “database” Use the vocabulary index to find the inverted list for “database” Return documents in the inverted list Documents containing “database” AND “search” OR? NOT?37What are “all” the keywords? All sequences of letters (up to a given length)? … that actually appear in documents! All words in English? Plus all phrases?Al i i h h b i iAlternative: approximate phrase search by proximity Minus all stop words They appear in nearly every document, e.g., a, of, the, it Not useful in search Combine words with common stems Example: database, databases They can be treated as the same for the purpose of search8Frequency and proximity Frequency hkeyword, { hdoc-id, number-of-occurrencesi,hdoc-id, number-of-occurrencesi,… }iii ( df ) Proximity (and frequency) hkeyword, { hdoc-id, hposition-of-occurrence1,position-of-occurrence2, …i,hdoc-id, hposition-of-occurrnece1, …ii,… }i When doing AND, check for positions that are near9Signature files Store the matrix by columns and compress them For each document, store a w-bit signature Each word is hashed into a w-bit value, with onlys < w bits turned on Signature is computed by taking the bit-wise OR of the hash values of all words on the document) Some false positives; no false negativeshash(“database”) = 0110hash(“dog”) = 1100hash(“cat”) = 0010doc1contains “database”: 0110doc2contains “dog”: 1100doc3contains “cat” and “dog”: 1110Does doc3contain“database”?410Bit-sliced signature files Motivation To check if a document contains a word, we only need to check the bits that are set in the word’s hash valueSo why bother retrieving all wbits doc Signature1 0 0 0 0 1 0 0 02 0 0 0 0 1 0 0 03 0 0 0 1 1 0 1 04 0 1 1 0 1 1 0 0So why bother retrieving all wbits of the signature? Instead of storing n signature files, store w bit slices Only check the slices that correspond to the set bits in the word’s hash value Start from the sparse slices……N0 0 0 0 1 0 1 0Bit-sliced signature filesSlice 0Slice 7…Starting to look likean inverted list again!11Inverted lists versus signatures Inverted lists better for most purposes (TODS, 1998) Problems of signature files False positives Hard to use because s, w, and the hash function need tuning to work well Long documents will likely have mostly 1’s in signatures Common words will create mostly 1’s for their slices Difficult to extend with features such as frequency, proximity Saving grace of signature files Sizes are tunable Good for lots of search terms Good for computing similarity of documents12Ranking result pages A single search may return many pages A user will not look at all result pages Complete result may be unnecessary)Result pages need to be rankedPossible ranking criteriaPossible ranking criteria Based on content• Number of occurrences of the search terms• Similarity to the query text Based on link structure• Backlink count• PageRank And more…513Textual similarity Vocabulary: [w1, …, wn] IDF (Inverse Document Frequency): [f1, …, fn] fi= 1 / the number of times wiappears on the Web Significance of words on page p: [p1f1, …, pnfn]gpgp[p1f1,,pnfn] piis the number of times wiappears on p Textual similarity between two pages p and q is defined to be [p1f1, …, pnfn] · [q1f1, …, qnfn] =p1q1f12+ … + pnqnfn2 q could be the query text14Why weight significance by IDF?15Problems with content-based ranking Many pages containing search terms may be of poor quality or irrelevant Example: a page with just a line “search engine” Many high-quality or relevant pages do not even contain the search terms Example: Google homepage Page containing more occurrences of the search terms are ranked higher; spamming is easy Example: a page with line “search engine” repeated many times616Backlink A page with more backlinks is ranked higher Intuition: Each backlink is a “vote” for the page’s importance Based on local link structure; still easy to spam Create lots of pages that point to a particular page17Google’s PageRank Main idea: Pages pointed by high-ranking pages are ranked higher Definition is recursive by design Based on global link structure; hard to spamNaïve PageRankNaïve PageRank N(p): number of outgoing links from page p B(p): set of pages that point to p PageRank(p) = Σq∈B(p)(PageRank(q) ⁄ N(q)))Each page p gets a boost of its
View Full Document