CS347 Lecture 10 May 14 2001 Prabhakar Raghavan Topics du jour Centroid nearest neighbor classification Bayesian Classification Link based classification Document summarization Centroid NN Given training docs for a topic compute their centroid Now have a centroid for each topic Given query doc assign to topic whose centroid is nearest Example Government Science Arts Bayesian Classification As before train classifier on exemplary docs from classes c c c Given test doc d estimate 1 2 r Pr d belongs to class cj Pr cj d Apply Bayes Theorem Pr cj d Pr d Pr d cj Pr cj So Pr d cj Pr cj Pr cj d Pr d Express Pr d as r i 1 Pr d ci Pr ci Reverse Engineering To compute Pr cj d all we need are Pr d ci and Pr ci for all i Will get these from training Training Given a set of training docs together with a class label for each training doc e g these docs belong to Physics those others to Astronomy etc Estimating Pr ci Pr ci Fraction of training docs that are labeled ci In practice use more sophisticated smoothing to boost probabilities of classes under represented in sample Estimating Pr d ci Basic assumption each occurrence of each word in each doc is independent of all others For a word w from sample docs Pr w ci Frequency of word w amongst all docs labeled ci Pr d ci Pr w ci w d Example Thus the probability of a doc consisting of Friends Romans Countrymen Pr Friends Pr Romans Pr Countrymen In implementations pay attention to precision underflow Extract all probabilities from term doc matrix To summarize Training Use class frequencies in training data for P r ci Estimate word frequencies for each word and each class to estimate Pr w ci Test doc d Use the Pr w ci values to estimate Pr d ci for each class ci Determine class cj for which Pr cj d is maximized Abstract features So far have relied on word counts as the features to train and classify on In general could be any statistic terms in boldface count for more authors of cited docs number of equations square of the number of commas Abstract features Bayesian in practice Many improvements used over na ve version discussed above various models for document generation varying emphasis on words in different portions of docs smoothing statistics for infrequent terms classifying into a hierarchy Supervised learning deployment issues Uniformity of docs in training test Quality of authorship Volume of training data Typical empirical observations Training 1000 docs class Accuracy upto 90 in the very best circumstances below 50 in the worst SVM vs Bayesian SVM appears to beat variety of Bayesian approaches both beat centroid based methods SVM needs quadratic programming more computation than na ve Bayes at training time less at classification time Bayesian classifiers also partition vector space but not using linear decision surfaces Classifying hypertext Given a set of hyperlinked docs Class labels for some docs available Figure out class labels for remaining docs Example c1 c3 c3 c2 c2 c4 Bayesian hypertext classification Besides the terms in a doc derive cues from linked docs to assign a class to test doc Cues could be any abstract features from doc and its neighbors Feature selection Attempt 1 use terms in doc those in its neighbors Generally does worse than terms in doc alone Why Neighbors terms diffuse focus of doc s terms Attempt 2 Use terms in doc plus tagged terms from neighbors E g car denotes a term occurring in d car I denotes a term occurring in a doc with a link into d car O denotes a term occurring in a doc with a link from d Generalizations possible car OIOI Attempt 2 also fails Key terms lose density e g car gets split into car car I car O Better attempt Use class labels of in and out neighbors as features in classifying d e g docs about physics point to docs about physics Setting some neighbors have pre assigned labels need to figure out the rest Content neighbors classes Na ve Bayes gives Pr cj d based on the words in d Now consider Pr cj N where N is the set of labels of d s neighbors Can separate N into in and out neighbors Can combine conditional probs for cj from text and link based evidence Training As before use training data to compute Pr N cj etc Assume labels of d s neighbors independent as we did with word occurrences Also continue to assume word occurrences within d are independent Classification Can invert probs using Bayes to derive Pr cj N Need to know class labels for all of d s neighbors Unknown neighbor labels What if all neighbors class labels are not known First use word content alone to assign a tentative class label to each unlabelled doc Next iteratively recompute all tentative labels using word content as well as neighbors classes some tentative Convergence This iterative relabeling will converge provided tentative labels not too far off Guarantee requires ideas from Markov random fields used in computer vision Error rates significantly below text alone classification End of classification Move on to document summarization Document summarization Given a doc produce a short summary Length of summary a parameter Application form factor Very sensitive to doc quality Typically corpus independent Summarization Simplest algorithm Output the first 50 or however many words of the doc Hard to beat on high quality docs For very short summaries e g 5 words drop stop words Summarization Slightly more complex heuristics Compute an importance score for each sentence Summary output contains sentences from original doc beginning with the most important Article from WSJ Example 1 Tandon Corp will introduce a portable hard disk drive today that will enable personal computer owners to put all of their programs and data on a single transportable cartridge that can be plugged into other computers 2 Tandon which has reported big losses in recent quarters as it shifted its product emphasis from disk drives to personal computer systems asserts that the new device called Personal Data Pac could change the way software is sold and even the way computers are used 3 The company based in Moorpark Calif also will unveil a new personal computer compatible with International Business Machines Corp s PC AT advanced personal computer that incorporates the new portable hard disks 4 It s an idea we ve been working on for several years said Chuck Peddle president of Tandon s computer systems division 5 As the price of hard disks kept coming down we realized that if we could make them portable the floppy disk drive would be a useless accessory 6
View Full Document