DOC PREVIEW
Berkeley COMPSCI 186 - Text/Web Search II - Ranking & Crawling

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Review Simple Relational Text Index Text Web Search II Ranking Crawling Create and populate a table InvertedFile term string docID string Term Build a B tree or Hash index on InvertedFile term Use something like Alternative 3 index Keep lists at the bottom sorted by docID Typically called a postings list Berkeley 42 49 57 Berkeley Database Research Boolean Search in SQL Classical IR Ranking SELECT IB docID FROM InvertedFile IB InvertedFile ID InvertedFile IR WHERE IB docID ID docID AND ID docID IR docID AND IB term Berkeley AND ID term Database AND IR term Research ORDER BY magic rank Abstraction Vector space model We ll think of every document as a vector This time we wrote it as a join Last time wrote it as an INTERSECT Recall our query plan An indexscan on each Ix term instance in FROM clause A merge join of the 3 indexscans ordered by docID magic rank is the secret sauce in the search engines Will require rewriting this query somewhat Imagine there are 10 000 possible terms Each document bag of words can be represented as an array of 10 000 counts This array can be thought of as a point in 10 000dimensional space Measure distance between two vectors similarity of two documents A query is just a short document Rank all docs by their distance to the query document Classical IR Ranking TF IDF What s the right distance metric Problem 1 two long docs seem more similar to each other than to short docs Solution normalize each dimension by vector s Euclidean length Now every doc is a point on the unit sphere to see this in 2D rotate so one vector is 1 0 BTW for normalized vectors cosine ranking is the same as ranking by Euclidean distance Counting occurrences isn t a good way to weight each term Want to favor repeated terms in this doc Want to favor unusual words in this doc TF IDF Term Frequency Inverse Doc Frequency For each doc d DocTermRank occurrences of t in d log total docs docs with this term Now the dot product sum of products of two normalized vectors happens to be cosine of the angle between them dj dk dj dk cos What is the idf of a term that occurs in all of the docs In almost no docs TF IDF Instead of using counts in the vector use DocTermRank Let s add some more to our schema TermInfo term string numDocs int used to compute IDF This is a materialized view on the invertedFile table What s the SQL for the view InvertedFile term string docID int64 DocTermRank float Why not just store TF rather than DocTermRank 1 Sort InvertedFile term string docID int64 DocTermRank float In SQL Again i qTermRanki DocTermRanki Ranking Simple Boolean Search CREATE VIEW BooleanResult AS SELECT IB docID IB DocTermRank as bTFIDF ID DocTermRank as dTFIDF IR DocTermRank as rTFIDF FROM InvertedFile IB InvertedFile ID InvertedFile IR WHERE IB docID ID docID AND ID docID IR docID AND IB term Berkeley AND ID term Database AND IR term Research Cosine similarity Berkeley Research DTRank docID DTRank docID DTRank 42 0 361 16 0 137 29 0 987 49 0 126 49 0 654 49 0 876 57 0 111 57 0 321 121 0 002 We ll only rank Boolean results Note this is just a heuristic Why What s a fix Is it feasible Note that the query SELECT docID doc vector is a Berkeley tfidf bTFIDF constant Database tfidf dTFIDF Research TFIDF rTFIDF AS magic rank FROM BooleanResult ORDER BY magic rank Database docID Recall a merge join of the postings lists from each term sorted by docID While merging postings lists For each docID that matches on all terms Bool Compute cosine distance to query I e For all terms Sum of product of query term rank and DocTermRank This collapses the view in the previous slide Note that there s usually another join stage Parallelizing Partition InvertedFile by Berkeley DocID Parallel top k Partition InvertedFile by term Distributed Join top k parallel or not Pros cons What are the relevant metrics d 4 o 4 2 c 5 I9 7 D DT 0 36 Ran 0 12 1 k 0 11 6 1 top k i d 1 o 4 6 c 5 I9 7 D DT 0 13 Ran 0 65 7 k 0 32 4 1 Docs docID title URL crawldate snippet i Research Database d 2 o 4 9 c 1 I9 2 D 1 Berkeley DT 0 98 Ran 0 87 7 k 0 00 6 2 d o 4 2 c 4 I9 5 7 D DT Ran 0 36 1 k 0 12 6 0 11 1 Research Database d o 1 6 c 4 I9 5 7 D DT Ran 0 13 7 k 0 65 4 0 32 1 d o 2 9 c 4 I9 1 2 D 1 top k Join Berkeley d 4 o 4 2 c 5 I9 7 D DT 0 36 Ran 0 12 1 k 0 11 6 1 What s wrong with this picture Database d o 4 c 4 2 I9 5 D 7 DT Ran 0 36 k 0 12 1 0 11 6 1 Quality of a non Boolean Answer Research d 4 o c 4 2 I9 5 D 7 DT 0 36 Ran k 0 12 1 0 11 6 1 DT Ran 0 98 7 k 0 87 6 0 00 2 SELECT title URL crawldate snippet Berkeley tfidf bTFIDF Database tfidf dTFIDF Research TFIDF rTFIDF AS magic rank FROM BooleanResult Docs WHERE BooleanResult docID Docs docID ORDER BY magic rank Typically rank before the join with Docs not an interesting order so a fully parallel join with Docs and or you can replicate the Docs table Phrase Proximity Ranking Sort i qTermRanki DocTermRanki Suppose only top k answers are retrieved Two common metrics Precision Correct Retrieved Retrieved Recall Correct Retrieved Correct Retrieved Correct Query The Who How many matches Berkeley do DTRan Database do DTRan cI 42 D 49 k 0 361 cI 16 D 49 k 0 137 0 126 57 0 111 57 0 321 0 654 Our previous query plan Ranking quality One idea index all 2 word runs in a doc bigrams can generalize to n grams give higher rank to bigram matches More generally proximity matching how many words characters apart add a list of positions field to the inverted index ranking function scans these two lists to compute proximate usage cook this into the overall rank 2 Some Additional Ranking Tricks Hypertext Ranking 1 3 1 27 1 100 1 0 1 3 1 3 Query expansion suggestions Can do similarity lookups on terms expand modify people s queries Fix misspellings E g via an inverted index on q grams of letters Trigrams for misspelling are mis iss ssp spe pel ell lli lin ing Document expansion Can add terms to a doc before inserting into inverted file 1 Everybody starts with weight 1 0 2 Share your weight among all your outlinks 3 Repeat …


View Full Document

Berkeley COMPSCI 186 - Text/Web Search II - Ranking & Crawling

Documents in this Course
Load more
Download Text/Web Search II - Ranking & Crawling
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text/Web Search II - Ranking & Crawling and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text/Web Search II - Ranking & Crawling 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?