This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CMSC828G Principles of Data Mining Lecture #21• Today’s Reading:– HMS, 6.3, 7.3.1, 10• Today’s Lecture:– Predictive Modeling • bias and variance in KNN•SVMs• evaluating predictive models• feature selection– Finding Patterns• Upcoming Due Dates:–H3 due thuNearest Neighbor Methods• To classify a new input vector y, examine the k-closest training data points to y and assign the object to the most frequently occurring classyk=1k=6Issues• Distance measure–Most common:euclidean• Choosing k– Increasing k reduces variance, increases bias• For high-dimensional space, problem that the nearest neighbor may not be very close at all!• Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large datasets.• Indexing the data can help; for example KD treesBias and Variance in NN vs. Linear Regression• NN – no assumptions about the model (low bias)prediction not stable (high variance)• LR – strong assumptions about the model (high bias)prediction is stable (low variance)β=ˆXyˆT∑∈=)x(Nxikiyk1)x(yˆSVM Resources• www.support-vector.net• tutorials:– http://www.support-vector.net/icml-tutorial.pdf– C.J.C.Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. –A.J.Smolaand B.Schölkopf. A Tutorial on Support Vector Regression. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998. • books:• Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge University Press, (2000).• Schölkopf, S., Burges, C. J. C., Smola, A. J., Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, (1999).Support Vector Machines (SVMs)• currently the ‘en vogue’ approach to classification, regression• successful applications in a wide variety of fields: bioinformatics, text, handwriting recognition, image processing• draws people from machine learning, optimization, statistics, neural networks, functional analysis• introduced by Bosner, Guyon and Vapnik, 1992• Kernel Machines: SVMs are a particular instance2Basic Idea• For a given learning task, with a given finite amount of training data, the best generalization performance: balance between the accuracy on that particular training set and the “capacity” of the machine (ability of machine to learn any training set without error.• botanist example (Burges)– Too much capacity is like a botanist with a photographic memory who, when presented with new tree, concludes that it is not a tree because it has a different number of leaves from anything seen before– Too little capacity: if it’s green, it’s a tree… neither generalize well!Linear SVMs• Simplest case: classification using hyperplane in input space• The Perceptron Algorithm (Rosenblatt, 57)•Notation:– input space: x ∈ X–output space: y ∈ Y = {-1,+1}– hypothesis: h ∈ H– function: f: X→R– training Set: S ={(x1,y1),…,(xm,ym)}– test error: ε– dot product 〈x,z〉PerceptronLinear separation of the inputspacef(x) = 〈w,x〉+bh(x)=sign(f(x))wboxoooooxxxxxxUpdate rule:if yi 〈wk,xi〉+b <=0 thenwk+1 = wk + ηyi xik++Observation• Solution is linear combination of training points:0,xywiiii≥αα=∑• Only used informative points (mistake driven)• can represent in dual form:bx,xybx,w)x(fiii+α=+=∑Linear Support Vector Machines•d+ distance to closest positive example, d-distance to closest negative example.• Define margin of separating hyperplane to be d++ d-• SVMs look for hyperplane with the largest marginoxoooooxxxxxxd-d+support vectorsConstrained Optimization1yfor1bwxii−=−≤+⋅1yfor1bwxii+=+≥+⋅combine:i01)bwx(yii∀≥−+⋅(1)(2)1bwx:Hi1=+⋅for points for which equality (1) hold, lie on hyperplane H1:with normal w and perpendicular distance from the originwb1 −1bwx:Hi2−=+⋅for points for which equality (2) hold, lie on hyperplane H2:wb1 −−with normal w and perpendicular distance from the originw1dd ==−+margin ; can maximize margine by minimizing subject to constraints2ww23Lagrangian•Introduce Lagrange multipliers αI• Minimize LPwith respect to w, b and αI >= 0• convex programming problem()∑∑==α++⋅α−≡l1iiiil1ii2Pbwxyw21Ljijij,ijil1iiDxxyy21L ⋅αα−α=∑∑=•can equivalently solve the dual problem LD•maximize LDwith respect to constraints on w, b and αI >= 0Limitations of LSVMs• Linear classifiers cannot deal with– nonlinearly separable data– noisy data• One solution: network of simple linear classifiers: Neural Network• Other solution: map data into a richer feature space which includes non-linear features, then use a linear classifierSVMs cont.• Map data into feature space where they are linearly separable)x(x φ→oxooxxφ(o)φ(x)φ(x)φ(x)φ(o)φ(o)SVMs cont.• In dual representation, the data points only appear inside dot products:• Introduce Kernels• function that returns the value of the dot product between the images of the two arguments• solves computational problem of working in high dimensional space)x(),x()x,x(K2121φ=b)x(),x(y)x(fiii+φφα=∑SVMs summary• Properties: duality, kernels, margin, convexity• based on statistical learning theory; connections to VC-dimension• very general and rich class of pattern recognition methodsEvaluating and Comparing Classifiers• estimating true future error rate• reclassify the training data and see the proportion that is misclassified. Called the apparentor resubstitutionerror rate. This is typically underestimate of future proportion and is ill-advised.• Estimate future error by calculating error on new sample, the test set• If data set is small, use cross-validation: leave some proportion of the data and evaluate on – leave-one-out4Cross-validation• If data set is small, use cross-validation: leave out some proportion of the data when classifier is constructed and evaluate on left out portion. Repeat, with different parts of the data being omitted.– leave-one-out: leave out one point at a time– bootstrap methods:• Introduced by Efron, 1979, to calculate confidence intervals for parameters when sample size is small• at a high-level:– Observations randomly re-assigned and estimates are recomputed; treated as repeated experimentsFeature Selection• classification in high dimensions--Too many


View Full Document

UMD CMSC 828G - Principles of Data Mining

Documents in this Course
Lecture 2

Lecture 2

35 pages

Load more
Download Principles of Data Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Principles of Data Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Principles of Data Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?