New version page


Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

Welcome to This Course What Can Data Mining and Analysis Do? Terminology: Some (Near-)SynonymsTodayWhat This Course isGoals of This CourseBackground NeededTaking this Course Book Grading How Hard Will This Course Be? A Dataset Main Types of Columns Main Goal of Learning: Prediction 5 Main Learning Tasks Classification Classification Example: Cancer Example: Netflix Example: Zipcodes Example: Google Example: Call Centers Example: Stock Market Growth of Machine Learning Main Things You Should Know Now to Hear From You...CS 4245 Lecture 1Overview of Data Mining and Analysis IAlexander [email protected] Institute of TechnologyCS 4245 Lecture 1 – p. 1/29Welcome to This CourseCS 4245 (89658); cross-listed as ISYE 4245Introduction to Data Mining and AnalysisTuTh 4:35pm-5:55pm, Klaus˜agray/4245fall10Instructor: Prof. Alexander GrayTA: Nishant Mehta [email protected] office hours: Monday 3-5pm, 1305 KlausMy office hours: Grab me right after a lectureCS 4245 Lecture 1 – p. 2/29What Can Data Mining and Analysis Do?Examples:Google: Targeted advertisingSupermarkets: Promotion planningCall centers: Speech recognitionScanners: Optical character recognitionPost office: Zipcode handwriting recognitionCredit cards: Loan default predictionStock market: Statistical arbitrageDrug design: Drug candidate screeningLarge Hadron Collider: Particle screeningCS 4245 Lecture 1 – p. 3/29Terminology: Some (Near-)SynonymsMachine learningData miningPattern recognitionComputational statisticsAdvanced or predictive analytics (used in business)From now on I will just refer to “machine learning” (ML).Some bigger concepts that ML is part of:Statistics (e.g. includes hypothesis testing)Data analysis (e.g. includes visualization)Artificial intelligence (e.g. includes planning)Applied mathematics, computational science andengineering (CSE) (e.g. includes optimization)CS 4245 Lecture 1 – p. 4/29Today1. What is this course?2. What is data mining and analysis?I’ll stick around for questions.CS 4245 Lecture 1 – p. 5/29What is this course?All the logistical information needed for you to answer thequestion “Should I take this course?”.CS 4245 Lecture 1 – p. 6/29What This Course isComputational techniques for analysis of large, complexdatasets, covering:fundamental aspects (i.e. mathematical foundations)modern data mining and analysis techniques ( 4245 Lecture 1 – p. 7/29Goals of This CourseGive you the basics needed for:Competent analysis of data (application of ML),using common techniquesTaking more advanced data analysis courses (in CS,ISYE, Math, etc)Give you a glimpse of the big picture of this field,including context for other coursesCS 4245 Lecture 1 – p. 8/29Background NeededMultivariable calculus (Math 2401 or Math 2411 or Math2605)Basics of programming (CS 1332 or CS 1372)I will teach you basic probability and statistics!Linear algebra helps but is not needed for this courseAbility to write programs and make plots (use anylanguage/system you like)CS 4245 Lecture 1 – p. 9/29Taking this CourseYes, you should take this class, if you have thebackgroundYes, you can get into the class if you have thebackground – if you can’t register online, email me for apermitCS 4245 Lecture 1 – p. 10/29BookThe Elements of Statistical Learning: Data Mining,Inference, and Prediction, Hastie, Tibshirani, andFriedman, 2nd edition (free online – see the coursewebpage)My slides – if it is not mentioned in my slides, it isnot an official topic of the courseOptional:All of Statistics, WassermanMy slides will cover the material of Wasserman that isrelevant to this courseI recommend buying physical books (from one of thecampus bookstores or online), as it will be a great long-termresource for you.CS 4245 Lecture 1 – p. 11/29Grading(tentative:)60% assignments: about every 2 weeks, involving someconceptual/mathematical questions (from the textbook),and experimental questions (running programs on data)20% midterm: on roughly first half of class,short-answer and multiple-choice20% final: on roughly second half of class,short-answer and multiple-choiceCS 4245 Lecture 1 – p. 12/29How Hard Will This Course Be?Mathematical, but no proofs to be written; only shortderivationsLots of computer experiments, some programmingAverage paceCS 4245 Lecture 1 – p. 13/29What is data mining andanalysis?An introduction to the topic intended to answer the question“Why is this cool?”.CS 4245 Lecture 1 – p. 14/29A DatasetData/points/instances/examples/samples/records: rowsFeatures/attributes/dimensions/independentvariables/covariates/predictors/regressors: columnsTarget/outcome/response/label/dependent variable:special column to be predictedCS 4245 Lecture 1 – p. 15/29Main Types of ColumnsContinuous: a number, like an age or heightDiscrete: a symbol, like “cat” or “dog”CS 4245 Lecture 1 – p. 16/29Main Goal of Learning: PredictionThe setup:1. You obtain some kind of model based on someexamples, or training data, through a process calledlearning (also estimation).2. Then you use that model to predict something aboutdata you haven’t seen before, but that comes from thesame distribution as the training data, called test data.CS 4245 Lecture 1 – p. 17/295 Main Learning Tasks1. Classification: predict a discrete target variable2. Regression: predict a continuous target variable3. Density estimation: predict the distribution4. Clustering: predict clusters5. Dimensionality reduction: predict new featuresSupervised learning: We’re predicting a target variablefor which we get to see examples. (regression,classification)Unsupervised learning: We’re predicting a targetvariable for which we never get to see examples.(density estimation, clustering, dimensionalityreduction)CS 4245 Lecture 1 – p. 18/29ClassificationWe’re predicting a discrete target variable. Supervised.CS 4245 Lecture 1 – p. 19/29ClassificationSimple example of a classfication model (“classifier”): Usethe label of the past training point which is most similar tothe new test point, and return that as the prediction(“nearest-neighbor”).CS 4245 Lecture 1 – p. 20/29Example: CancerApplication: automatic disease detectionImportance: this is modern/future medical diagnosis.Prediction goal: Based on past patients, predictwhether you have the diseaseData: Past patients with and without the diseaseTarget: Cancer or no-cancerFeatures: Concentrations of various proteins in

View Full Document
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...

Join to view LECTURE NOTES and access 3M+ class-specific study document.

We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.


By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?