Introduction to Machine Learning Xiaojin Zhu [email protected] Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/S00196ED1V01Y200906AIM006 Morgan & Claypool Publishers, 2009. (download from UW computers)Outline • Representing “things” – Feature vector – Training sample • Unsupervised learning – Clustering • Supervised learning – Classification – RegressionLittle green men • The weight and height of 100 little green men • What can you learn from this data?A less alien example • From Iain Murray http://homepages.inf.ed.ac.uk/imurray2/Representing “things” in machine learning • An instance x represents a specific object (“thing”) • x often represented by a D-dimensional feature vector x = (x1, . . . , xD) ∈ RD • Each dimension is called a feature. Continuous or discrete. • x is a dot in the D-dimensional feature space • Abstraction of object. Ignores any other aspects (two men having the same weight, height will be identical)Feature representation example • Text document – Vocabulary of size D (~100,000): “aardvark … zulu” • “bag of word”: counts of each vocabulary entry – To marry my true love (3531:1 13788:1 19676:1) – I wish that I find my soulmate this year (3819:1 13448:1 19450:1 20514:1) • Often remove stopwords: the, of, at, in, … • Special “out-of-vocabulary” (OOV) entry catches all unknown wordsMore feature representations • Image – Color histogram • Software – Execution profile: the number of times each line is executed • Bank account – Credit rating, balance, #deposits in last day, week, month, year, #withdrawals … • You and me – Medical test1, test2, test3, …Training sample • A training sample is a collection of instances x1, . . . , xn, which is the input to the learning process. • xi = (xi1, . . . , xiD) • Assume these instances are sampled independently from an unknown (population) distribution P(x) • We denote this by xi ∼ P(x), where i.i.d. stands for independent and identically distributed. i.i.d.Training sample • A training sample is the “experience” given to a learning algorithm • What the algorithm can learn from it varies • We introduce two basic learning paradigms: – unsupervised learning – supervised learningUNSUPERVISED LEARNING No teacher.Unsupervised learning • Training sample x1, . . . , xn, that’s it • No teacher providing supervision as to how individual instances should be handled • Common tasks: – clustering, separate the n instances into groups – novelty detection, find instances that are very different from the rest – dimensionality reduction, represent each instance with a lower dimensional feature vector while maintaining key characteristics of the training samplesClustering • Group training sample into k clusters • How many clusters do you see? • Many clustering algorithms – HAC – k-means – …Example 1: music island • Organizing and visualizing music collection CoMIRVA http://www.cp.jku.at/comirva/Example 2: Google NewsExample 3: your digital photo collection • You probably have >1000 digital photos, ‘neatly’ stored in various folders… • After this class you’ll be about to organize them better – Simplest idea: cluster them using image creation time (EXIF tag) – More complicated: extract image featuresTwo most frequently used methods • Many clustering algorithms. We’ll look at the two most frequently used ones: – Hierarchical clustering Where we build a binary tree over the dataset – K-means clustering Where we specify the desired number of clusters, and use an iterative algorithm to find themHierarchical clustering • Very popular clustering algorithm • Input: – A dataset x1, …, xn, each point is a numerical feature vector – Does NOT need the number of clustersHierarchical Agglomerative Clustering • Euclidean (L2) distanceHierarchical clustering • Initially every point is in its own clusterHierarchical clustering • Find the pair of clusters that are the closestHierarchical clustering • Merge the two into a single clusterHierarchical clustering • Repeat…Hierarchical clustering • Repeat…Hierarchical clustering • Repeat…until the whole dataset is one giant cluster • You get a binary tree (not shown here)Hierarchical clustering • How do you measure the closeness between two clusters?Hierarchical clustering • How do you measure the closeness between two clusters? At least three ways: – Single-linkage: the shortest distance from any member of one cluster to any member of the other cluster. Formula? – Complete-linkage: the greatest distance from any member of one cluster to any member of the other cluster – Average-linkage: you guess it!Hierarchical clustering • The binary tree you get is often called a dendrogram, or taxonomy, or a hierarchy of data points • The tree can be cut at various levels to produce different numbers of clusters: if you want k clusters, just cut the (k-1) longest links • Sometimes the hierarchy itself is more interesting than the clusters • However there is not much theoretical justification to it…Advance topics • Constrained clustering: What if an expert looks at the data, and tells you – “I think x1 and x2 must be in the same cluster” (must-links) – “I think x3 and x4 cannot be in the same cluster” (cannot-links) x1 x2 x3 x4Advance topics • This is clustering with supervised information (must-links and cannot-links). We can • Change the clustering algorithm to fit constraints • Or , learn a better distance measure • See the book Constrained Clustering: Advances in Algorithms, Theory, and Applications Editors: Sugato Basu, Ian Davidson, and Kiri Wagstaff http://www.wkiri.com/conscluster/ x1 x2 x3 x4K-means clustering • Very popular clustering method • Don’t confuse it with the k-NN classifier • Input: – A dataset x1, …, xn, each point is a numerical feature vector – Assume the number of clusters, k, is givenK-means clustering • The dataset. Input k=5K-means clustering • Randomly picking 5 positions as initial cluster centers (not necessarily a data point)K-means clustering • Each point finds which cluster center it is closest to (very much like 1NN). The point belongs to that cluster.K-means clustering • Each cluster
View Full Document