http www cs cmu edu guestrin Class 10701 What s learning Point Estimation Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University January 18th 2005 Growth of Machine Learning Machine learning is preferred approach to Speech recognition Natural language processing Computer vision Medical outcomes analysis Robot control This trend is accelerating Improved machine learning algorithms Improved data capture networking faster computers Software too complex to write by hand New sensors IO devices Demand for self customization to user environment Syllabus Covers a wide range of Machine Learning techniques from basic to state of the art You will learn about the methods you heard about Na ve Bayes logistic regression nearest neighbor decision trees boosting neural nets overfitting regularization dimensionality reduction PCA error bounds VC dimension SVMs kernels margin bounds K means EM mixture models semisupervised learning HMMs graphical models active learning reinforcement learning Covers algorithms theory and applications It s going to be fun and hard work Prerequisites Probabilities Distributions densities marginalization Basic statistics Moments typical distributions regression Algorithms Dynamic programming basic data structures complexity Programming Mostly your choice of language but Matlab will be very useful We provide some background but the class will be fast paced Ability to deal with abstract mathematical concepts Review Sessions Very useful Review material Present background Answer questions Thursdays 5 00 6 30 in Wean Hall 5409 First recitation is tomorrow Review of probabilities Special recitation on Matlab Jan 25 Wed 5 00 7 00pm NSH 3305 Staff Four Great TAs Great resource for learning interact with them Anton Chechetka antonc cs Stanislav Funiak sfuniak cs Andreas Krause krausea cs Jure Leskovec jure cs Course General Czar Terrill L Frantz TerrillFrantz cmu Administrative Assistant Monica Hopes x8 5527 meh cs First Point of Contact for HWs To facilitate interaction a TA will be assigned to each homework question This will be your first point of contact for this question But you can always ask any of us For e mailing instructors always use 10701 instructors cs cmu edu For announcements subscribe to 10701 announce cs https mailman srv cs cmu edu mailman listinfo 10701 announce All Text Books are Optional but very useful Machine Learning Tom Mitchell Pattern Classification 2nd Edition Duda Hart and Stork Neural Networks for Pattern Recognition Chris Bishop Grading 5 homeworks 30 First one goes out 1 23 Final project 20 Details out March 1st Midterm 20 March 8th Final 30 TBD by registrar Homeworks Homeworks are hard start early Due in the beginning of class 3 late days for the semester After late days are used up Half credit within 48 hours Zero credit after 48 hours All homeworks must be handed in even for zero credit Late homeworks handed in to Monica Hopes WEH 4616 Collaboration You may discuss the questions Each student writes their own answers Write on your homework anyone with whom you collaborate Enjoy ML is becoming ubiquitous in science engineering and beyond This class should give you the basic foundation for applying ML and developing new methods The fun begins What is Machine Learning Machine Learning Study of algorithms that improve their performance at some task with experience Object detection Prof H Schneiderman Example training images for each orientation Text classification Company home page vs Personal home page vs Univeristy home page vs Reading a noun vs verb Rustandi et al 2005 Modeling sensor data 50 OFFICE 52 49 12 9 54 OFFICE 51 53 QUIET PHONE 11 8 16 15 10 CONFERENCE 13 14 7 17 18 STORAGE 48 LAB ELEC COPY 5 47 19 6 4 46 45 21 3 2 SERVER 44 KITCHEN 39 37 42 41 38 36 23 33 35 40 22 1 43 29 27 31 34 25 32 30 28 26 Measure temperatures at some locations Predict temperatures throughout the environment Guestrin et al 04 20 24 Learning to act Reinforcement learning An agent Makes sensor observations Must select action Receives rewards Ng et al 05 positive for good states negative for bad states Your first consulting job A billionaire from the suburbs of Seattle asks you a question He says I have thumbtack if I flip it what s the probability it will fall with the nail up You say Please flip it a few times You say The probability is He says Why You say Because Thumbtack Binomial Distribution P Heads P Tails 1 Flips are i i d Independent events Identically distributed according to Binomial distribution Sequence D of H Heads and T Tails Maximum Likelihood Estimation Data Observed set D of H Heads and T Tails Hypothesis Binomial distribution Learning is an optimization problem What s the objective function MLE Choose that maximizes the probability of observed data Your first learning algorithm Set derivative to zero How many flips do I need Billionaire says I flipped 3 heads and 2 tails You say 3 5 I can prove it He says What if I flipped 30 heads and 20 tails You say Same answer I can prove it He says What s better You say Humm The more the merrier He says Is this why I am paying you the big bucks Simple bound based on Hoeffding s inequality For N H T and Let be the true parameter for any 0 PAC Learning PAC Probably Approximate Correct Billionaire says I want to know the thumbtack parameter within 0 1 with probability at least 1 0 95 How many flips What about prior Billionaire says Wait I know that the thumbtack is close to 50 50 What can you You say I can learn it the Bayesian way Rather than estimating a single we obtain a distribution over possible values of Bayesian Learning Use Bayes rule Or equivalently Bayesian Learning for Thumbtack Likelihood function is simply Binomial What about prior Represent expert knowledge Simple posterior form Conjugate priors Closed form representation of posterior For Binomial conjugate prior is Beta distribution Beta prior distribution P Likelihood function Posterior Posterior distribution Prior Data H heads and T tails Posterior distribution Using Bayesian posterior Posterior distribution Bayesian inference No longer single parameter Integral is often hard to compute MAP Maximum a posteriori approximation As more data is observed Beta is more certain MAP use most likely parameter MAP for Beta distribution MAP use most likely parameter Beta prior equivalent to extra thumbtack flips As N prior is forgotten But for small sample size prior is important What you need to know Go to the recitation on intro to
View Full Document