UMD CMSC 828G - Principles of Data Mining - D1677266

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> Principles of Data Mining

DOC PREVIEW

UMD CMSC 828G - Principles of Data Mining

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CMSC828G Principles of Data Mining Lecture #21• Today’s Reading:– HMS, 6.3, 7.3.1, 10• Today’s Lecture:– Predictive Modeling • bias and variance in KNN•SVMs• evaluating predictive models• feature selection– Finding Patterns• Upcoming Due Dates:–H3 due thuNearest Neighbor Methods• To classify a new input vector y, examine the k-closest training data points to y and assign the object to the most frequently occurring classyk=1k=6Issues• Distance measure–Most common:euclidean• Choosing k– Increasing k reduces variance, increases bias• For high-dimensional space, problem that the nearest neighbor may not be very close at all!• Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large datasets.• Indexing the data can help; for example KD treesBias and Variance in NN vs. Linear Regression• NN – no assumptions about the model (low bias)prediction not stable (high variance)• LR – strong assumptions about the model (high bias)prediction is stable (low variance)β=ˆXyˆT∑∈=)x(Nxikiyk1)x(yˆSVM Resources• www.support-vector.net• tutorials:– http://www.support-vector.net/icml-tutorial.pdf– C.J.C.Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. –A.J.Smolaand B.Schölkopf. A Tutorial on Support Vector Regression. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998. • books:• Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge University Press, (2000).• Schölkopf, S., Burges, C. J. C., Smola, A. J., Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, (1999).Support Vector Machines (SVMs)• currently the ‘en vogue’ approach to classification, regression• successful applications in a wide variety of fields: bioinformatics, text, handwriting recognition, image processing• draws people from machine learning, optimization, statistics, neural networks, functional analysis• introduced by Bosner, Guyon and Vapnik, 1992• Kernel Machines: SVMs are a particular instance2Basic Idea• For a given learning task, with a given finite amount of training data, the best generalization performance: balance between the accuracy on that particular training set and the “capacity” of the machine (ability of machine to learn any training set without error.• botanist example (Burges)– Too much capacity is like a botanist with a photographic memory who, when presented with new tree, concludes that it is not a tree because it has a different number of leaves from anything seen before– Too little capacity: if it’s green, it’s a tree… neither generalize well!Linear SVMs• Simplest case: classification using hyperplane in input space• The Perceptron Algorithm (Rosenblatt, 57)•Notation:– input space: x ∈ X–output space: y ∈ Y = {-1,+1}– hypothesis: h ∈ H– function: f: X→R– training Set: S ={(x1,y1),…,(xm,ym)}– test error: ε– dot product 〈x,z〉PerceptronLinear separation of the inputspacef(x) = 〈w,x〉+bh(x)=sign(f(x))wboxoooooxxxxxxUpdate rule:if yi 〈wk,xi〉+b <=0 thenwk+1 = wk + ηyi xik++Observation• Solution is linear combination of training points:0,xywiiii≥αα=∑• Only used informative points (mistake driven)• can represent in dual form:bx,xybx,w)x(fiii+α=+=∑Linear Support Vector Machines•d+ distance to closest positive example, d-distance to closest negative example.• Define margin of separating hyperplane to be d++ d-• SVMs look for hyperplane with the largest marginoxoooooxxxxxxd-d+support vectorsConstrained Optimization1yfor1bwxii−=−≤+⋅1yfor1bwxii+=+≥+⋅combine:i01)bwx(yii∀≥−+⋅(1)(2)1bwx:Hi1=+⋅for points for which equality (1) hold, lie on hyperplane H1:with normal w and perpendicular distance from the originwb1 −1bwx:Hi2−=+⋅for points for which equality (2) hold, lie on hyperplane H2:wb1 −−with normal w and perpendicular distance from the originw1dd ==−+margin ; can maximize margine by minimizing subject to constraints2ww23Lagrangian•Introduce Lagrange multipliers αI• Minimize LPwith respect to w, b and αI >= 0• convex programming problem()∑∑==α++⋅α−≡l1iiiil1ii2Pbwxyw21Ljijij,ijil1iiDxxyy21L ⋅αα−α=∑∑=•can equivalently solve the dual problem LD•maximize LDwith respect to constraints on w, b and αI >= 0Limitations of LSVMs• Linear classifiers cannot deal with– nonlinearly separable data– noisy data• One solution: network of simple linear classifiers: Neural Network• Other solution: map data into a richer feature space which includes non-linear features, then use a linear classifierSVMs cont.• Map data into feature space where they are linearly separable)x(x φ→oxooxxφ(o)φ(x)φ(x)φ(x)φ(o)φ(o)SVMs cont.• In dual representation, the data points only appear inside dot products:• Introduce Kernels• function that returns the value of the dot product between the images of the two arguments• solves computational problem of working in high dimensional space)x(),x()x,x(K2121φ=b)x(),x(y)x(fiii+φφα=∑SVMs summary• Properties: duality, kernels, margin, convexity• based on statistical learning theory; connections to VC-dimension• very general and rich class of pattern recognition methodsEvaluating and Comparing Classifiers• estimating true future error rate• reclassify the training data and see the proportion that is misclassified. Called the apparentor resubstitutionerror rate. This is typically underestimate of future proportion and is ill-advised.• Estimate future error by calculating error on new sample, the test set• If data set is small, use cross-validation: leave some proportion of the data and evaluate on – leave-one-out4Cross-validation• If data set is small, use cross-validation: leave out some proportion of the data when classifier is constructed and evaluate on left out portion. Repeat, with different parts of the data being omitted.– leave-one-out: leave out one point at a time– bootstrap methods:• Introduced by Efron, 1979, to calculate confidence intervals for parameters when sample size is small• at a high-level:– Observations randomly re-assigned and estimates are recomputed; treated as repeated experimentsFeature Selection• classification in high dimensions--Too many

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

UMD CMSC 828G - Principles of Data Mining

Sign up for free to view:

Please select your school