Handout 2 More Similarity Searching Multidimensional Scaling 36 350 Data Mining August 27 2008 Reading Principles of Data Mining sections 14 1 14 4 skiping 14 3 3 for now and 3 7 Let s recap similarity searching for documents We represent each document as a bag of words i e a vector giving the number of times each word occurred in the document This abstracts away all the grammatical structure context etc leaving us with a matrix whose rows are feature vectors a data frame To find documents which are similar to a given document Q we calculate the distance between Q and all the other documents i e the distance between their feature vectors We then return the k closest documents Today we re going to look at some wrinkles and extensions Stemming It is a lot easier to decide what counts as a word in English than in some other languages 1 Even so we need to decide whether car and cars are the same word for our purposes or not Stemming takes derived forms of words like cars flying and reduces them to their stem car fly Doing this well requires linguistic knowledge so the system doesn t think the stem of potatoes is potatoe or that gravity is the same as grave and it can even be harmful if the document has Saturns plural it s most likely about the cars Multidimensional Scaling The bag of words vectors representing our documents generally live in spaces with lots of dimensions certainly more than three 1 For example Turkish is what is known as an aggulutinative language in which grammatical units are glued together to form compound words whose meaning would be a whole phrase or sentence in English e g gelemiyebelirim I may be unable to come yapabilecekdiyseniz if you were going to be able to do or calistirilmamaliymis supposedly he ought not to be made to work German does this too but not so much This causes problems with Turkish language applications because many sequences of letters separated by punctuation are effectively unique See for example L O zgu r T Gu ngo r and F Gu rgen Adaptive antispam filtering for agglutinative languages a special case for Turkish Pattern Recognition Letters 25 2004 1819 1831 available from http www cmpe boun edu tr gungort 1 which are hard for ordinary humans to visualize However we can compute the distance between any two vectors so we know how far apart they are Multidimensional scaling MDS is the general name for a family of algorithms which take high dimensional vectors and map them down to two or three dimensional vectors trying to preserve all the relevant distances Abstractly the idea is that we start with vectors v1 v2 vn in a p dimensional space where p is large and we want to find new vectors x1 x2 xn in R2 or R3 such that n X X 2 v1 v2 d x1 x2 i 1 j6 i is as small as possible where is distance in the original space and d is Euclidean distance in the new space Note that the new or image points xi are representations of the vi i e representations of representations There is some trickiness to properly minimizing this objective function for instance if we rotate all the xi through a common angle their distances are unchanged but it s not really a new solution and it s not usually possible to make it exactly zero See Sec 3 7 in the textbook for details We will see a lot of multidimensional scaling plots because they are nice visualization tools but we will also see a lot of other data reduction or dimensionality reduction methods because sometimes it s more important to preserve other properties than distances Classification One very important data mining task is classifying new pieces of data that is assigning them to one of a fixed number of classes Last time our two classes were about automobiles and about motorcycles Usually new data doesn t come with a class label so we have to somehow guess the class from the features 2 With a nearest neighbor strategy we guess that the new object is in the same class as the closest already classified object We saw this at the end of the last lecture With a prototype strategy we pick out the most representative member of each class or perhaps the average of each class as its prototype and guess that new objects belong to the class with the closer prototype We will see many other classifier rules in addition to these two but these are ones we can apply as soon as we know how to calculate distance Queries Are Documents I promised that we could avoid having to come up with an initial document The trick to this is to realize that a query whether an actual sentence What are the common problems of the 2001 model year Saturn or just a list of key words problems 2001 model Saturn is a small document If we represent user queries as bags of words we can use our similarity searching procedure on them If this works we have a search technique which find mostly relevant things the precision is high and most relevant items are found the recall is high 2 If it does come with a label we read the label 2 Normalization None Document length Euclidean length Equal weight 83 63 59 IDF weight 79 60 21 Table 1 Number of mis classifications in a larger 199 document collection of posts from rec auto and rec motorcycles for different normalizations of Euclidean distance with and without IDF weighting Classification is by the nearest neighbor method Inverse Document Frequency IDF Weighting We are using features word counts to identify documents which are relevant to our query Not all features are going to be equally useful Some words are so common that they give us almost no ability at all to discriminate between relevant and irrelevant documents In most collections of English documents looking at the of a etc is a waste of time We could handle this by a fixed list of stop words which we just don t count but this at once too crude all or nothing and too much work we need to think up the list Inverse document frequency IDF is a more adaptive approach The document frequency of a w is the number of documents it appears in nw The IDF weight of w is N IDF w log nw where N is the total size of our collection Now when we make our bag ofwords vector for the document Q the number of times w appears in Q Qw is multiplied by IDF w Notice that if w appears in every document nw N and it gets an IDF weight of zero we won t use it to calculate distances This takes care of most of the things we d use a list of stop words for but it also takes into account implicitly the kind of documents we re using In a data base of papers on genetics gene …
View Full Document
Unlocking...