New version page

Duke CPS 116 - Web Searching & Indexing

Documents in this Course
Part I

Part I

8 pages

XSLT

XSLT

4 pages

XSLT

XSLT

8 pages

Part I

Part I

8 pages

XSLT

XSLT

8 pages

Load more
Upgrade to remove ads

This preview shows page 1-2 out of 5 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

1Web Searching & IndexingCPS 116Introduction to Database Systems2Announcements (December 6) Homework #4 due on today (will be graded by this weekend) Course project demo Final exam on Tuesday, Dec. 13, 7-10pm Again, open book, open notes Focus on the second half of the course3Keyword searchGoogle…Web | Images | Groups| DirectoryGoogle Search | I’mFeeling LuckyAdvanced Search |Preferences | LanguageTools…Association forComputing MachineryFounded in 1947,ACM is the world’sfirst educational andscientific computingsociety. Today, ourmembers—…CPS 216: AdvancedDatabase Systems(Fall 2001)Course InformationCourse Description /Time and Place /BooksResources: Staff…The Internet MovieDatabase (IMDb)…… Search the InternetMovie Database. Formore search options,please visit Searchcentral…database AND search SearchWhat are the documents containing both “database” and “search”?4Keywords × documents Inverted lists: store the matrix by rows Signature files: store the matrix by columns111…1110…0001…0010…1001…0……………Document 1Document 2Document 3Document nAll documents“a”“database”“cat”“dog”“search”All keywords1 means keyword appears in the document0 means otherwise5Inverted lists Store the matrix by rows For each keyword, store an inverted list hkeyword, doc-id-listi h“database”, {3, 7, 142, 857, …}i h“search”, {3, 9, 192, 512, …}i It helps to sort doc-id-list (why?) Vocabulary index on keywords B+-tree or hash-based How large is an inverted list index?6Using inverted lists Documents containing “database” Use the vocabulary index to find the inverted list for “database” Return documents in the inverted list Documents containing “database” AND “search” Return documents in the intersection of the two inverted lists OR? NOT? Union and difference, respectively27What are “all” the keywords? All sequences of letters (up to a given length)? … that actually appear in documents! All words in English? Plus all phrases? Alternative: approximate phrase search by proximity Minus all stop words They appear in nearly every document, e.g., a, of, the, it Not useful in search Combine words with common stems Example: database, databases They can be treated as the same for the purpose of search8Frequency and proximity Frequency hkeyword, { hdoc-id, number-of-occurrencesi,hdoc-id, number-of-occurrencesi,… }i Proximity (and frequency) hkeyword, { hdoc-id, hposition-of-occurrence1,position-of-occurrence2, …i,hdoc-id, hposition-of-occurrnece1, …ii,… }i When doing AND, check for positions that are near9Signature files Store the matrix by columns and compress them For each document, store a w-bit signature Each word is hashed into a w-bit value, with only s< w bits turned on Signature is computed by taking the bit-wise OR of the hash values of all words on the document) Some false positives; no false negativeshash(“database”) = 0110hash(“dog”) = 1100hash(“cat”) = 0010doc1contains “database”: 0110doc2contains “dog”: 1100doc3contains “cat” and “dog”: 1110Does doc3contain“database”?10Bit-sliced signature files Motivation To check if a document contains a word, we only need to check the bits that are set in the word’s hash value So why bother retrieving all w bits of the signature? Instead of storing n signature files, store w bit slices Only check the slices that correspond to the set bits in the word’s hash value Start from the sparse slicesdoc signature1 0 0 0 0 1 0 0 02 0 0 0 0 1 0 0 03 0 0 0 1 1 0 1 04 0 1 1 0 1 1 0 0……n0 0 0 0 1 0 1 0Bit-sliced signature filesSlice 0Slice 7 …Starting to look likean inverted list again!11Inverted lists versus signatures Inverted lists better for most purposes (TODS, 1998) Problems of signature files False positives Hard to use because s, w, and the hash function need tuning to work well Long documents will likely have mostly 1’s in signatures Common words will create mostly 1’s for their slices Difficult to extend with features such as frequency, proximity Saving grace of signature files Sizes are tunable Good for lots of search terms Good for computing similarity of documents12Ranking result pages A single search may return many pages A user will not look at all result pages Complete result may be unnecessary)Result pages need to be ranked Possible ranking criteria Based on content• Number of occurrences of the search terms• Similarity to the query text  Based on link structure•Backlinkcount• PageRank And more…313Textual similarity Vocabulary: [w1, …, wn] IDF (Inverse Document Frequency): [f1, …, fn] fi= 1 / the number of times wiappears on the Web Significance of words on page p: [p1f1, …, pnfn] piis the number of times wiappears on p Textual similarity between two pages p and q is defined to be [p1f1, …, pnfn] · [q1f1, …, qnfn] =p1q1f12+ … + pnqnfn2 q could be the query text14Why weight significance by IDF? Without IDF weighting, the similarity measure would be dominated by the stop words “the” occurs frequently on the Web, so its occurrence on a particular page should be considered less significant “engine” occurs infrequently on the Web, so its occurrence on a particular page should be considered more significant15Problems with content-based ranking Many pages containing search terms may be of poor quality or irrelevant Example: a page with just a line “search engine” Many high-quality or relevant pages do not even contain the search terms Example: Google homepage Page containing more occurrences of the search terms are ranked higher; spamming is easy Example: a page with line “search engine” repeated many times16Backlink A page with more backlinks is ranked higher Intuition: Each backlink is a “vote” for the page’s importance Based on local link structure; still easy to spam Create lots of pages that point to a particular page17Google’s PageRank Main idea: Pages pointed by high-ranking pages are ranked higher Definition is recursive by design Based on global link structure; hard to spam Naïve PageRank N(p): number of outgoing links from page p B(p): set of pages that point to p PageRank(p) =


View Full Document
Download Web Searching & Indexing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Web Searching & Indexing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Web Searching & Indexing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?