DOC PREVIEW
Duke CPS 116 - Web Searching & Indexing

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Web Searching & IndexingCPS 116Introduction to Database Systems2Announcements (December 2) Homework #4 sample solution available Course project demo period: December 8-13 Each project gets a 30-minute slot with me Email me to schedule demo slots Final exam next Saturday, Dec. 13, 7-10pm Early final next Wednesday, Dec. 10, 9am-12pm Again, open book, open notes Focus on the second half of the course Sample final available Sample final solution available Thursday3Keyword searchGoogle…Web | Images | Groups| DirectoryGoogle Search | I’mFeeling LuckyAdvanced Search |Preferences | LanguageAssociation forComputing MachineryFounded in 1947,ACM is the world’sfirst educational andscientific computingsociety Today ourCPS 216: AdvancedDatabase Systems(Fall 2008)Course InformationCourse Description /Ti d Pl /The Internet MovieDatabase (IMDb)…… Search the InternetMovie Database. ForPreferences | LanguageTools…society. To d a y, ourmembers—…Time and Place /BooksResources: Staff…more search options,please visit Searchcentral…database AND search SearchWhat are the documents containing both “database” and “search”?24Keywords × documentsAll documents“a”“cat”All keywords111…1110…0 Inverted lists: store the matrix by rows Signature files: store the matrix by columns“database”“dog”“search”1 means keyword appears in the document;0 means otherwise001…0010…1001…0……………5Inverted lists Store the matrix by rows For each keyword, store an inverted list hkeyword, doc-id-listi h“database”, {3, 7, 142, 857, …}ih“ h” {3 9 192 512 }ih“search”, {3, 9, 192, 512, …}i It helps to sort doc-id-list (why?) Vocabulary index on keywords B+-tree or hash-based How large is an inverted list index?6Using inverted lists Documents containing “database” Use the vocabulary index to find the inverted list for “database” Return documents in the inverted list Documents containing “database” AND “search” OR? NOT?37What are “all” the keywords? All sequences of letters (up to a given length)? … that actually appear in documents! All words in English? Plus all phrases?Al i i h h b i iAlternative: approximate phrase search by proximity Minus all stop words They appear in nearly every document, e.g., a, of, the, it Not useful in search Combine words with common stems Example: database, databases They can be treated as the same for the purpose of search8Frequency and proximity Frequency hkeyword, { hdoc-id, number-of-occurrencesi,hdoc-id, number-of-occurrencesi,… }iii ( df ) Proximity (and frequency) hkeyword, { hdoc-id, hposition-of-occurrence1,position-of-occurrence2, …i,hdoc-id, hposition-of-occurrnece1, …ii,… }i When doing AND, check for positions that are near9Signature files Store the matrix by columns and compress them For each document, store a w-bit signature Each word is hashed into a w-bit value, with onlys < w bits turned on Signature is computed by taking the bit-wise OR of the hash values of all words on the document) Some false positives; no false negativeshash(“database”) = 0110hash(“dog”) = 1100hash(“cat”) = 0010doc1contains “database”: 0110doc2contains “dog”: 1100doc3contains “cat” and “dog”: 1110Does doc3contain“database”?410Bit-sliced signature files Motivation To check if a document contains a word, we only need to check the bits that are set in the word’s hash valueSo why bother retrieving all wbits doc Signature1 0 0 0 0 1 0 0 02 0 0 0 0 1 0 0 03 0 0 0 1 1 0 1 04 0 1 1 0 1 1 0 0So why bother retrieving all wbits of the signature? Instead of storing n signature files, store w bit slices Only check the slices that correspond to the set bits in the word’s hash value Start from the sparse slices……N0 0 0 0 1 0 1 0Bit-sliced signature filesSlice 0Slice 7…Starting to look likean inverted list again!11Inverted lists versus signatures Inverted lists better for most purposes (TODS, 1998) Problems of signature files False positives Hard to use because s, w, and the hash function need tuning to work well Long documents will likely have mostly 1’s in signatures Common words will create mostly 1’s for their slices Difficult to extend with features such as frequency, proximity Saving grace of signature files Sizes are tunable Good for lots of search terms Good for computing similarity of documents12Ranking result pages A single search may return many pages A user will not look at all result pages Complete result may be unnecessary)Result pages need to be rankedPossible ranking criteriaPossible ranking criteria Based on content• Number of occurrences of the search terms• Similarity to the query text  Based on link structure• Backlink count• PageRank And more…513Textual similarity Vocabulary: [w1, …, wn] IDF (Inverse Document Frequency): [f1, …, fn] fi= 1 / the number of times wiappears on the Web Significance of words on page p: [p1f1, …, pnfn]gpgp[p1f1,,pnfn] piis the number of times wiappears on p Textual similarity between two pages p and q is defined to be [p1f1, …, pnfn] · [q1f1, …, qnfn] =p1q1f12+ … + pnqnfn2 q could be the query text14Why weight significance by IDF?15Problems with content-based ranking Many pages containing search terms may be of poor quality or irrelevant Example: a page with just a line “search engine” Many high-quality or relevant pages do not even contain the search terms Example: Google homepage Page containing more occurrences of the search terms are ranked higher; spamming is easy Example: a page with line “search engine” repeated many times616Backlink A page with more backlinks is ranked higher Intuition: Each backlink is a “vote” for the page’s importance Based on local link structure; still easy to spam Create lots of pages that point to a particular page17Google’s PageRank Main idea: Pages pointed by high-ranking pages are ranked higher Definition is recursive by design Based on global link structure; hard to spamNaïve PageRankNaïve PageRank N(p): number of outgoing links from page p B(p): set of pages that point to p PageRank(p) = Σq∈B(p)(PageRank(q) ⁄ N(q)))Each page p gets a boost of its


View Full Document

Duke CPS 116 - Web Searching & Indexing

Documents in this Course
Part I

Part I

8 pages

XSLT

XSLT

4 pages

XSLT

XSLT

8 pages

Part I

Part I

8 pages

XSLT

XSLT

8 pages

Load more
Download Web Searching & Indexing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Web Searching & Indexing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Web Searching & Indexing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?