DOC PREVIEW
GSU CSC 2010 - Chapter 6

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Chapter 6 Classification and PredictionClassification and PredictionClassificationLearning step: Model ConstructionLearning stepPrediction step: Using the Model in PredictionPrediction stepN-fold Cross-validationSupervised learning VS Unsupervised learningIssues: Data PreparationIssues: Evaluating Classification MethodsDecision TreeDecision Tree ExampleDecision Tree AlgorithmSlide 15Slide 16Slide 17Slide 18Slide 19Slide 20Attribute Selection Measure: Information Gain (ID3/C4.5)Slide 22Slide 23Slide 24Another Decision Tree ExampleDecision Tree ExampleSlide 27Slide 28Chapter 6 Classification and PredictionDr. Bernard Chen Ph.D.University of Central ArkansasFall 2010Classification and PredictionClassification and Prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trendsFor example:Bank loan applicants are “safe” or “risky”Guess a customer will buy a new computer?Analysis cancer data to predict which one of three specific treatments should applyClassificationClassification is a Two-Step ProcessLearning step:classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new dataPrediction step:predicts categorical class labels (discrete or nominal)Learning step: Model ConstructionTrainingDataNAME RANK YEARS T ENU REDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 noClassificationAlgorithmsIF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ Classifier(Model)Learning stepModel construction: describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attributeThe set of tuples used for model construction is training setThe model is represented as classification rules, decision trees, or mathematical formulaePrediction step:Using the Model in Prediction ClassifierTestingDataNAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yesUnseen Data(Jeff, Professor, 4)Tenured?Prediction stepEstimate accuracy of the modelThe known label of test sample is compared with the classified result from the modelAccuracy rate is the percentage of test set samples that are correctly classified by the modelTest set is independent of training set, otherwise over-fitting will occurN-fold Cross-validation In order to solve over-fitting problem, n-fold cross-validation is usually usedFor example, 7 fold cross validation:Divide the whole training dataset into 7 parts equallyTake the first part away, train the model on the rest 6 portionsAfter the model is trained, feed the first part as testing dataset, obtain the accuracyRepeat step two and three, but take the second part away and so on…Supervised learning VS Unsupervised learningBecause the class label of each training tuple is provided, this step is also known as supervised learningIt contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is unknownIssues: Data PreparationData cleaningPreprocess data in order to reduce noise and handle missing valuesRelevance analysis (feature selection)Remove the irrelevant or redundant attributesData transformationGeneralize and/or normalize dataIssues: Evaluating Classification MethodsAccuracySpeedtime to construct the model (training time)time to use the model (classification/prediction time)Robustness: handling noise and missing valuesScalability: efficiency in disk-resident databases InterpretabilityDecision TreeDecision Tree induction is the learning of decision trees from class-labeled training tuplesA decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attributeEach Branch represents an outcome of the testEach Leaf node holds a class labelDecision Tree ExampleDecision Tree AlgorithmBasic algorithm (a greedy algorithm)Tree is constructed in a top-down recursive divide-and-conquer mannerAt start, all the training examples are at the rootAttributes are categorical (if continuous-valued, they are discretized in advance)Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent noDecision Tree940.0)145(log145)149(log149)5,9()(22IDInfoage income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent noDecision Tree694.0)2,3(145)0,4(144)3,2(145)(IIIDInfoageage income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no means “age <=30” has 5 out of 14 samples, with 2 yes’s and 3 no’s.I(2,3) = -2/5 * log(2/5) – 3/5 * log(3/5) )3,2(145IDecision TreeInfoage (D)= 5/14 I(2,3)+4/14 I(4,0)+ 5/14 I(3,2) =5/14 * ( )+4/14 * (0)+5/14 * ( )=0.694For type in http://web2.0calc.com/:-2/5*log2(2/5)-3/5*log2(3/5)Decision Tree (0.940 - 0.694)Similarily, we can computeGain(income)=0.029Gain(student)=0.151Gain(credit_rating)=0.048Since “student” obtains highest information gain, we can partition the tree into:246.0)()()(  DInfoDInfoageGainageDecision TreeInfoincome (D)= 4/14


View Full Document

GSU CSC 2010 - Chapter 6

Download Chapter 6
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Chapter 6 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Chapter 6 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?