Notes and Announcements• Midterm exam: Oct 20, Wednesday, In Class• Late Homeworks– Turn in hardcopies to Michelle.– DO NOT ask Michelle for extensions.– Note down the date and time of submission.– If submitting softcopy, email to 10-701 instructors list.– Software needs to be submitted via Blackboard.• HW2 out today – watch email1ProjectsHands-on experience with Machine Learning Algorithms –understand when they work and fail, develop new ones!Project Ideas online, discuss TAs, every project must have a TA mentor• Proposal (10%): Oct 11• Mid-term report (25%): Nov 8• Poster presentation (20%): Dec 2, 3-6 pm, NSH Atrium• Final Project report (45%): Dec 62Project Proposal• Proposal (10%): Oct 11– 1 pg maximum– Describe data set – Project idea (approx two paragraphs)– Software you will need to write. – 1-3 relevant papers. Read at least one before submitting your proposal.– Teammate. Maximum team size is 2. division of work– Project milestone for mid-term report? Include experimental results.3Recitation Tomorrow!4• Linear & Non-linear Regression, Nonparametric methods• Strongly recommended!!• Place: NSH 1507 (Note)• Time: 5-6 pmTKNon-parametric methodsKernel density estimate, kNN classifier, kernel regressionAarti SinghMachine Learning 10-701/15-781Sept 29, 2010Parametric methods• Assume some functional form (Gaussian, Bernoulli, Multinomial, logistic, Linear) for– P(Xi|Y) and P(Y) as in Naïve Bayes– P(Y|X) as in Logistic regression• Estimate parameters (m,s2,q,w,b) using MLE/MAP and plug in• Pro – need few data points to learn parameters• Con – Strong distributional assumptions, not satisfied in practice6Example735879421Hand-written digit imagesprojected as points on a two-dimensional (nonlinear) feature spacesNon-Parametric methods• Typically don’t make any distributional assumptions• As we have more data, we should be able to learn more complex models• Let number of parameters scale with number of training data • Today, we will see some nonparametric methods for– Density estimation– Classification– Regression8Histogram density estimate9Partition the feature space into distinct bins with widths ¢iand count the number of observations, ni, in each bin.• Often, the same width is used for all bins, ¢i= ¢.• ¢ acts as a smoothing parameter.Image src: Bishop bookEffect of histogram bin width10# bins = 1/DAssuming density it roughly constant in each bin(holds true if D is small)Bias of histogram density estimate:xBias – Variance tradeoff• Choice of #bins• Bias – how close is the mean of estimate to the truth• Variance – how much does the estimate vary around meanSmall D, large #bins “Small bias, Large variance”Large D, small #bins “Large bias, Small variance”Bias-Variance tradeoff11# bins = 1/D(p(x) approx constant per bin)(more data per bin, stable estimate)Choice of #bins12Image src: Bishop book# bins = 1/DImage src: Larry bookfixed nD decreasesnidecreasesMSE = Bias + VarianceHistogram as MLE• Class of density estimates – constants on each binParameters pj- density in bin jNote since • Maximize likelihood of data under probability model with parameters pj• Show that histogram density estimate is MLE under this model – HW/Recitation13• Histogram – blocky estimate• Kernel density estimate aka “Parzen/moving window method”Kernel density estimate14-5 -4 -3 -2 -1 0 1 2 3 4 500.050.10.150.20.250.30.35-5 -4 -3 -2 -1 0 1 2 3 4 500.050.10.150.20.250.30.35• more generallyKernel density estimate15-11Kernel density estimation16Gaussian bumps (red) around six data points and their sum (blue) • Place small "bumps" at each data point, determined by the kernel function. • The estimator consists of a (normalized) "sum of bumps”.• Note that where the points are denser the density estimate will have higher values.Img src: WikipediaKernels17Any kernel function that satisfiesKernels18Finite support – only need local points to computeestimateInfinite support- need all points tocompute estimate-But quite popular since smoother (10-702)Choice of kernel bandwidth19Image Source: Larry’s book – All of NonparametricStatisticsBart-Simpson DensityToo smallToo largeJust rightHistograms vs. Kernel density estimation20D = h acts as a smoother.Bias-variance tradeoff• Simulations21k-NN (Nearest Neighbor) density estimation• Histogram• Kernel density estFix D, estimate number of points within D of x (nior nx) from dataFix nx= k, estimate D from data (volume of ball around x that contains k training pts)• k-NN density est22k-NN density estimation23k acts as a smoother.Not very popular for densityestimation - expensive to compute, bad estimatesBut a related versionfor classification quite popular…From Density estimation to Classification24k-NN classifier25SportsScienceArtsk-NN classifier26SportsScienceArtsTest documentk-NN classifier (k=4)27SportsScienceArtsTest documentWhat should we predict? … Average? Majority? Why?Dk,xk-NN classifier28• Optimal Classifier:• k-NN Classifier:(Majority vote)# total training pts of class y# training pts of class ythat lie within Dkball1-Nearest Neighbor (kNN) classifier SportsScienceArts292-Nearest Neighbor (kNN) classifier SportsScienceArts30K even not usedin practice3-Nearest Neighbor (kNN) classifier SportsScienceArts315-Nearest Neighbor (kNN) classifier SportsScienceArts32What is the best K?33Bias-variance tradeoffLarger K => predicted label is more stable Smaller K => predicted label is more accurate Similar to density estimationChoice of K - in next class …1-NN classifier – decision boundary34K = 1VoronoiDiagramk-NN classifier – decision boundary35• K acts as a smoother (Bias-variance tradeoff)• Guarantee: For , the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error.Case Study:kNN for Web Classification• Dataset – 20 News Groups (20 classes)– Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/)– 61,118 words, 18,774 documents– Class labels descriptions37Experimental Setup• Training/Test Sets: – 50%-50% randomly split. – 10 runs– report average results• Evaluation Criteria:38Results: Binary Classesalt.atheismvs.comp.graphicsrec.autosvs. rec.sport.baseballcomp.windows.xvs. rec.motorcycleskAccuracy39From Classificationto Regression40Temperature sensing• What is the temperature in the room?41Average “Local” Averageat location
View Full Document