1Web Searching & IndexingCPS 116Introduction to Database Systems2Announcements (December 6) Homework #4 due on today (will be graded by this weekend) Course project demo Final exam on Tuesday, Dec. 13, 7-10pm Again, open book, open notes Focus on the second half of the course3Keyword searchGoogle…Web | Images | Groups| DirectoryGoogle Search | I’mFeeling LuckyAdvanced Search |Preferences | LanguageTools…Association forComputing MachineryFounded in 1947,ACM is the world’sfirst educational andscientific computingsociety. Today, ourmembers—…CPS 216: AdvancedDatabase Systems(Fall 2001)Course InformationCourse Description /Time and Place /BooksResources: Staff…The Internet MovieDatabase (IMDb)…… Search the InternetMovie Database. Formore search options,please visit Searchcentral…database AND search SearchWhat are the documents containing both “database” and “search”?24Keywords × documents Inverted lists: store the matrix by rows Signature files: store the matrix by columns111…1110…0001…0010…1001…0……………Document 1Document 2Document 3Document nAll documents“a”“database”“cat”“dog”“search”All keywords1 means keyword appears in the document0 means otherwise5Inverted lists Store the matrix by rows For each keyword, store an inverted list hkeyword, doc-id-listi h“database”, {3, 7, 142, 857, …}i h“search”, {3, 9, 192, 512, …}i It helps to sort doc-id-list (why?) Vocabulary index on keywords B+-tree or hash-based How large is an inverted list index?6Using inverted lists Documents containing “database” Use the vocabulary index to find the inverted list for “database” Return documents in the inverted list Documents containing “database” AND “search” OR? NOT?37What are “all” the keywords? All sequences of letters (up to a given length)? … that actually appear in documents! All words in English? Plus all phrases? Alternative: approximate phrase search by proximity Minus all stop words They appear in nearly every document, e.g., a, of, the, it Not useful in search Combine words with common stems Example: database, databases They can be treated as the same for the purpose of search8Frequency and proximity Frequency hkeyword, { hdoc-id, number-of-occurrencesi,hdoc-id, number-of-occurrencesi,… }i Proximity (and frequency) hkeyword, { hdoc-id, hposition-of-occurrence1,position-of-occurrence2, …i,hdoc-id, hposition-of-occurrnece1, …ii,… }i When doing AND, check for positions that are near9Signature files Store the matrix by columns and compress them For each document, store a w-bit signature Each word is hashed into a w-bit value, with only s< w bits turned on Signature is computed by taking the bit-wise OR of the hash values of all words on the document) Some false positives; no false negativeshash(“database”) = 0110hash(“dog”) = 1100hash(“cat”) = 0010doc1contains “database”: 0110doc2contains “dog”: 1100doc3contains “cat” and “dog”: 1110Does doc3contain“database”?410Bit-sliced signature files Motivation To check if a document contains a word, we only need to check the bits that are set in the word’s hash value So why bother retrieving all w bits of the signature? Instead of storing n signature files, store w bit slices Only check the slices that correspond to the set bits in the word’s hash value Start from the sparse slicesdoc signature1 0 0 0 0 1 0 0 02 0 0 0 0 1 0 0 03 0 0 0 1 1 0 1 04 0 1 1 0 1 1 0 0……n0 0 0 0 1 0 1 0Bit-sliced signature filesSlice 0Slice 7 …Starting to look likean inverted list again!11Inverted lists versus signatures Inverted lists better for most purposes (TODS, 1998) Problems of signature files False positives Hard to use because s, w, and the hash function need tuning to work well Long documents will likely have mostly 1’s in signatures Common words will create mostly 1’s for their slices Difficult to extend with features such as frequency, proximity Saving grace of signature files12Ranking result pages A single search may return many pages A user will not look at all result pages Complete result may be unnecessary)Result pages need to be ranked Possible ranking criteria Based on content• Number of occurrences of the search terms• Similarity to the query text Based on link structure•Backlinkcount• PageRank And more…513Textual similarity Vocabulary: [w1, …, wn] IDF (Inverse Document Frequency): [f1, …, fn] fi= 1 / the number of times wiappears on the Web Significance of words on page p: [p1f1, …, pnfn] piis the number of times wiappears on p Textual similarity between two pages p and q is defined to be [p1f1, …, pnfn] · [q1f1, …, qnfn] =p1q1f12+ … + pnqnfn2 q could be the query text14Why weight significance by IDF? Without IDF weighting, the similarity measure would be dominated by the stop words “the” occurs frequently on the Web, so its occurrence on a particular page should be considered less significant “engine” occurs infrequently on the Web, so its occurrence on a particular page should be considered more significant15Problems with content-based ranking Many pages containing search terms may be of poor quality or irrelevant Example: a page with just a line “search engine” Many high-quality or relevant pages do not even contain the search terms Example: Google homepage Page containing more occurrences of the search terms are ranked higher; spamming is easy Example: a page with line “search engine” repeated many times616Backlink A page with more backlinks is ranked higher Intuition: Each backlink is a “vote” for the page’s importance Based on local link structure; still easy to spam Create lots of pages that point to a particular page17Google’s PageRank Main idea: Pages pointed by high-ranking pages are ranked higher Definition is recursive by design Based on global link structure; hard to spam Naïve PageRank N(p): number of outgoing links from page p B(p): set of pages that point to p PageRank(p) = Σq∈B(p)(PageRank(q) ⁄ N(q)))Each page p gets a boost of its importance from each page that points to p)Each page q evenly distributes its importance to all pages that qpoints to18Calculating naïve
View Full Document