Unformatted text preview:

Overview of topics CS347 Review Slides IR Part II June 6 2001 Prabhakar Raghavan Supervised vs unsupervised learning Unsupervised learning Given corpus infer structure implicit in the docs without prior training Supervised learning Train system to recognize docs of a certain type e g docs in Italian or docs about religion Decide whether or not new docs belong to the class es trained on Clustering Agglomerative k means Classification Rule based Support Vector Machines Naive Bayes Finding communities aka Trawling Summarization Recommendation systems Why cluster documents Given a corpus partition it into groups of related docs Recursively can induce a tree of topics Given the set of docs from the results of a search say jaguar partition into groups of related docs semantic disambiguation Agglomerative clustering Given target number of clusters k Initially each doc viewed as a cluster start with n clusters Repeat while there are k clusters find the closest pair of clusters and merge them k means At the start of the iteration we have k centroids Need not be docs just some k points Axes could be terms links etc Loop Each doc assigned to the nearest centroid All docs assigned to the same centroid are averaged to compute a new centroid thus have k new centroids 1 Given one or more topics decide which one s a given document belongs to Applications Classification into a topic taxonomy Intelligence analysts Routing email to help desks customer service Choice of topic must be unique Accuracy measurement Confusion matrix Topic assigned by classifier Actual Topic Classification 53 This i j entry means 53 of the docs actually in topic i were put in topic j by the classifier Explicit queries Classification by exemplary docs Feed system exemplary docs on topic training Positive as well as negative examples System builds its model of topic Subsequent test docs evaluated against model decides whether test is a member of the topic Topic queries can be built up from other topic queries Vector Spaces Support Vector Machine SVM Support vectors Each training doc a point vector labeled by its topic Hypothesis docs of the same topic form a contiguous region of space Define surfaces to delineate topics in space Quadratic programming problem The decision function is fully specified by training samples which lie on two parallel hyper planes Maximize margin 2 Naive Bayes Content neighbors classes Training Use class frequencies in training data for Pr ci Estimate word frequencies for each word and each class to estimate Pr w ci Test doc d Use the Pr w ci values to estimate Pr d ci for each class ci Determine class cj for which Pr cj d is maximized Na ve Bayes gives Pr cj d based on the words in d Now consider Pr cj N where N is the set of labels of d s neighbors Can separate N into in and out neighbors Can combine conditional probs for cj from text and link based evidence Finding communities on the web Document Summarization not easy since web is huge what is a dense subgraph define i j core complete bipartite subgraph with i nodes all of which point to each of j others Lexical chains look for terms appearing in consecutive sentences For each sentence S in the doc f S a h S b t S where h S total score of all chains starting at S Fans Centers and t S total score of all chains covering S but not starting at S 2 3 core Recommendation Systems Recommend docs to user based on user s context besides the docs content Other applications Re rank search results Locate experts Targeted ads 3


View Full Document

Stanford CS 347 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?