Duke CPS 116 - Web Searching & Indexing - D1182642

Home> Schools> Duke University> (CPS) > CPS 116> Web Searching & Indexing

DOC PREVIEW

Duke CPS 116 - Web Searching & Indexing

School name Duke University

Course Cps 116- Introduction to Database Systems

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Web Searching & IndexingCPS 116Introduction to Database Systems2Announcements (December 6) Homework #4 due on today (will be graded by this weekend) Course project demo Final exam on Tuesday, Dec. 13, 7-10pm Again, open book, open notes Focus on the second half of the course3Keyword searchGoogle…Web | Images | Groups| DirectoryGoogle Search | I’mFeeling LuckyAdvanced Search |Preferences | LanguageTools…Association forComputing MachineryFounded in 1947,ACM is the world’sfirst educational andscientific computingsociety. Today, ourmembers—…CPS 216: AdvancedDatabase Systems(Fall 2001)Course InformationCourse Description /Time and Place /BooksResources: Staff…The Internet MovieDatabase (IMDb)…… Search the InternetMovie Database. Formore search options,please visit Searchcentral…database AND search SearchWhat are the documents containing both “database” and “search”?4Keywords × documents Inverted lists: store the matrix by rows Signature files: store the matrix by columns111…1110…0001…0010…1001…0……………Document 1Document 2Document 3Document nAll documents“a”“database”“cat”“dog”“search”All keywords1 means keyword appears in the document0 means otherwise5Inverted lists Store the matrix by rows For each keyword, store an inverted list hkeyword, doc-id-listi h“database”, {3, 7, 142, 857, …}i h“search”, {3, 9, 192, 512, …}i It helps to sort doc-id-list (why?) Vocabulary index on keywords B+-tree or hash-based How large is an inverted list index?6Using inverted lists Documents containing “database” Use the vocabulary index to find the inverted list for “database” Return documents in the inverted list Documents containing “database” AND “search” Return documents in the intersection of the two inverted lists OR? NOT? Union and difference, respectively27What are “all” the keywords? All sequences of letters (up to a given length)? … that actually appear in documents! All words in English? Plus all phrases? Alternative: approximate phrase search by proximity Minus all stop words They appear in nearly every document, e.g., a, of, the, it Not useful in search Combine words with common stems Example: database, databases They can be treated as the same for the purpose of search8Frequency and proximity Frequency hkeyword, { hdoc-id, number-of-occurrencesi,hdoc-id, number-of-occurrencesi,… }i Proximity (and frequency) hkeyword, { hdoc-id, hposition-of-occurrence1,position-of-occurrence2, …i,hdoc-id, hposition-of-occurrnece1, …ii,… }i When doing AND, check for positions that are near9Signature files Store the matrix by columns and compress them For each document, store a w-bit signature Each word is hashed into a w-bit value, with only s< w bits turned on Signature is computed by taking the bit-wise OR of the hash values of all words on the document) Some false positives; no false negativeshash(“database”) = 0110hash(“dog”) = 1100hash(“cat”) = 0010doc1contains “database”: 0110doc2contains “dog”: 1100doc3contains “cat” and “dog”: 1110Does doc3contain“database”?10Bit-sliced signature files Motivation To check if a document contains a word, we only need to check the bits that are set in the word’s hash value So why bother retrieving all w bits of the signature? Instead of storing n signature files, store w bit slices Only check the slices that correspond to the set bits in the word’s hash value Start from the sparse slicesdoc signature1 0 0 0 0 1 0 0 02 0 0 0 0 1 0 0 03 0 0 0 1 1 0 1 04 0 1 1 0 1 1 0 0……n0 0 0 0 1 0 1 0Bit-sliced signature filesSlice 0Slice 7 …Starting to look likean inverted list again!11Inverted lists versus signatures Inverted lists better for most purposes (TODS, 1998) Problems of signature files False positives Hard to use because s, w, and the hash function need tuning to work well Long documents will likely have mostly 1’s in signatures Common words will create mostly 1’s for their slices Difficult to extend with features such as frequency, proximity Saving grace of signature files Sizes are tunable Good for lots of search terms Good for computing similarity of documents12Ranking result pages A single search may return many pages A user will not look at all result pages Complete result may be unnecessary)Result pages need to be ranked Possible ranking criteria Based on content• Number of occurrences of the search terms• Similarity to the query text  Based on link structure•Backlinkcount• PageRank And more…313Textual similarity Vocabulary: [w1, …, wn] IDF (Inverse Document Frequency): [f1, …, fn] fi= 1 / the number of times wiappears on the Web Significance of words on page p: [p1f1, …, pnfn] piis the number of times wiappears on p Textual similarity between two pages p and q is defined to be [p1f1, …, pnfn] · [q1f1, …, qnfn] =p1q1f12+ … + pnqnfn2 q could be the query text14Why weight significance by IDF? Without IDF weighting, the similarity measure would be dominated by the stop words “the” occurs frequently on the Web, so its occurrence on a particular page should be considered less significant “engine” occurs infrequently on the Web, so its occurrence on a particular page should be considered more significant15Problems with content-based ranking Many pages containing search terms may be of poor quality or irrelevant Example: a page with just a line “search engine” Many high-quality or relevant pages do not even contain the search terms Example: Google homepage Page containing more occurrences of the search terms are ranked higher; spamming is easy Example: a page with line “search engine” repeated many times16Backlink A page with more backlinks is ranked higher Intuition: Each backlink is a “vote” for the page’s importance Based on local link structure; still easy to spam Create lots of pages that point to a particular page17Google’s PageRank Main idea: Pages pointed by high-ranking pages are ranked higher Definition is recursive by design Based on global link structure; hard to spam Naïve PageRank N(p): number of outgoing links from page p B(p): set of pages that point to p PageRank(p) =

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

Duke CPS 116 - Web Searching & Indexing

Sign up for free to view:

Please select your school