Categorizing Data Vectors Types of Categorization Basic Classifiers Finding Simple Clusters in Data 36 350 Data Mining 11 September 2009 Reading Textbook sections 9 3 9 5 Contents 1 Categorization and Classifiers 1 2 Why Cluster 2 1 Good clusters 2 3 3 The k means algorithm 3 1 Geometry of k Means 3 2 k Means as Search Algorithm 4 5 6 A Geometric Aspects of the Basic Classifiers A 1 Prototype Method A 1 1 Rate of Convergence A 2 Nearest Neighbor Method 7 7 8 8 1 Categorization and Classifiers Dividing data into discrete categories is one of the most common kinds of datamining task Often the categories are things which are given to us in advance by some kind of background knowledge cells cancerous or not or the kind of decision we are going to make credit applicant will they pay back the loan or not or simply by some taxonomy which our institution has decided to use text politics or religion automobile or motorcycle pictures flower tiger or ocean This case of assigning new data to pre existing categories is called classification and the categories are called classes When we have some examples or training data which have been labeled by someone who knew what they were doing we have a supervised learning 1 problem The point of classification methods is to accurately assign new unlabeled examples from the test data to these classes This is supervised learning because we can check the performance on the labeled training data The point of calculating information was to select features which made classification easier We have already seen two algorithms for classification 1 In nearest neighbor classification we assign each new data point to the same class as the closest labeled vector or exemplar or example to be slightly less fancy This uses lots of memory because we need to keep track of many vectors and time because we need to calculate lots of distances but assumes next to nothing about the geometry of the classes 2 In prototype classification we represent each class by a single vector its prototype and assign new data to the class whose prototype is closest This uses little memory or computation time but implicitly assumes that each class forms a compact in fact convex region in the feature space The appendix to these notes says more about the geometry of these methods and how it relates to their performance as classifiers A more general theme however is that statistical and data mining problems usually have a trade off between methods which make strong assumptions and deliver good results when the assumptions hold but break otherwise and methods which make weak assumptions and work more broadly One manifestation of this very much displayed by prototype vs nearest neighbor classifiers is the trade off between precision and accuracy We will say much more about this when we look at how to evaluate and compare predictive models 2 Why Cluster Classification depends on having both known categories and labeled examples of the categories If there are known categories but no labeled examples we may be able to do some kind of query feedback reinforcement or learning if we can check guesses about category membership Rocchio s algorithm takes feedback from a user and learns to classify search results as relevant or not relevant But we might not have known classes to start with In these unsupervised situation one thing we can try to do is to discover categories which are in implicit in the data themselves These categories are called clusters rather than classes and finding them is the subject of clustering or cluster analysis See Table 1 Even if our data comes to us with class labels attached it s often wise to be skeptical of their use Official classification schemes are often the end result of a mix of practical experience intuition old theories prejudice ideas copied from somewhere else old notions enshrined in forms databases and software compromises among groups which differ in interests ideas and clout preferences for simple rules and people making stuff up because they need something by 2 Known classes Yes Yes Yes No Class labels Given for training data Given for some but not all training data Hints feedback None Type of learning problem Classification supervised learning Semi supervised learning Feedback or reinforcement learning Clustering unsupervised learning Table 1 Partial taxonomy of learning problems deadline For instance the threshold for being overweight according to official medical statistics is a body mass index1 of 25 and obesity means a BMI of 30 or more not because there s a big qualitative difference between a BMI of 29 9 and one of 30 1 but just because they needed some break Moreover once a scheme gets established organizational inertia can keep it in place long after whatever relevance it once had has eroded away The Census Bureau set up a classification scheme for different jobs and industries in the 1930s so that for several decades there was one class of jobs in electronics including all of the computer industry The point being even when you have what you are told is a supervised learning problem with labeled data it can be worth treating it as unsupervised learning problem If the clusters you find match the official classes that s a sign that the official scheme is actually reasonable and relevant if they disagree you have to decide whether to trust the official story or your cluster finding method 2 1 Good clusters A good way to start thinking about how to cluster our data is to ask ourselves what properties we want in clusters First of all clusters like classes should partition the data every possible object should belong to one and only one cluster Beyond that it would be good if knowing which cluster an object belonged to told us by itself a lot about that object s properties In other words we would like the expected information in the cluster about the features to be large If the features are X1 X2 Xp and the cluster is C we would ideally maximize I X1 X2 Xp C Actually doing this maximization turns out to be very hard However we can say some things about what the maximally informative clusters would look like and use these properties to guide our search A high information value for the clusters means that knowing the cluster reduces our uncertainty about the features All else being equal this means that the objects in a cluster should be similar to each other or form a compact set of points in feature vector space Again all else being equal different
View Full Document
Unlocking...