Chapter 6 Classification and PredictionClassification and PredictionClassificationLearning step: Model ConstructionLearning stepPrediction step: Using the Model in PredictionPrediction stepN-fold Cross-validationSupervised learning VS Unsupervised learningIssues: Data PreparationIssues: Evaluating Classification MethodsDecision TreeDecision Tree ExampleDecision Tree AlgorithmSlide 15Slide 16Slide 17Slide 18Slide 19Slide 20Attribute Selection Measure: Information Gain (ID3/C4.5)Slide 22Slide 23Slide 24Another Decision Tree ExampleDecision Tree ExampleSlide 27Slide 28Chapter 6 Classification and PredictionDr. Bernard Chen Ph.D.University of Central ArkansasFall 2010Classification and PredictionClassification and Prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trendsFor example:Bank loan applicants are “safe” or “risky”Guess a customer will buy a new computer?Analysis cancer data to predict which one of three specific treatments should applyClassificationClassification is a Two-Step ProcessLearning step:classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new dataPrediction step:predicts categorical class labels (discrete or nominal)Learning step: Model ConstructionTrainingDataNAME RANK YEARS T ENU REDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 noClassificationAlgorithmsIF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ Classifier(Model)Learning stepModel construction: describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attributeThe set of tuples used for model construction is training setThe model is represented as classification rules, decision trees, or mathematical formulaePrediction step:Using the Model in Prediction ClassifierTestingDataNAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yesUnseen Data(Jeff, Professor, 4)Tenured?Prediction stepEstimate accuracy of the modelThe known label of test sample is compared with the classified result from the modelAccuracy rate is the percentage of test set samples that are correctly classified by the modelTest set is independent of training set, otherwise over-fitting will occurN-fold Cross-validation In order to solve over-fitting problem, n-fold cross-validation is usually usedFor example, 7 fold cross validation:Divide the whole training dataset into 7 parts equallyTake the first part away, train the model on the rest 6 portionsAfter the model is trained, feed the first part as testing dataset, obtain the accuracyRepeat step two and three, but take the second part away and so on…Supervised learning VS Unsupervised learningBecause the class label of each training tuple is provided, this step is also known as supervised learningIt contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is unknownIssues: Data PreparationData cleaningPreprocess data in order to reduce noise and handle missing valuesRelevance analysis (feature selection)Remove the irrelevant or redundant attributesData transformationGeneralize and/or normalize dataIssues: Evaluating Classification MethodsAccuracySpeedtime to construct the model (training time)time to use the model (classification/prediction time)Robustness: handling noise and missing valuesScalability: efficiency in disk-resident databases InterpretabilityDecision TreeDecision Tree induction is the learning of decision trees from class-labeled training tuplesA decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attributeEach Branch represents an outcome of the testEach Leaf node holds a class labelDecision Tree ExampleDecision Tree AlgorithmBasic algorithm (a greedy algorithm)Tree is constructed in a top-down recursive divide-and-conquer mannerAt start, all the training examples are at the rootAttributes are categorical (if continuous-valued, they are discretized in advance)Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent noDecision Tree940.0)145(log145)149(log149)5,9()(22IDInfoage income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent noDecision Tree694.0)2,3(145)0,4(144)3,2(145)(IIIDInfoageage income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no means “age <=30” has 5 out of 14 samples, with 2 yes’s and 3 no’s.I(2,3) = -2/5 * log(2/5) – 3/5 * log(3/5) )3,2(145IDecision TreeInfoage (D)= 5/14 I(2,3)+4/14 I(4,0)+ 5/14 I(3,2) =5/14 * ( )+4/14 * (0)+5/14 * ( )=0.694For type in http://web2.0calc.com/:-2/5*log2(2/5)-3/5*log2(3/5)Decision Tree (0.940 - 0.694)Similarily, we can computeGain(income)=0.029Gain(student)=0.151Gain(credit_rating)=0.048Since “student” obtains highest information gain, we can partition the tree into:246.0)()()( DInfoDInfoageGainageDecision TreeInfoincome (D)= 4/14
View Full Document