Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —PowerPoint PresentationChapter 8. Classification: Basic ConceptsSupervised vs. Unsupervised LearningPrediction Problems: Classification vs. Numeric PredictionClassification—A Two-Step ProcessProcess (1): Model ConstructionProcess (2): Using the Model in PredictionSlide 9Decision Tree Induction: An ExampleAlgorithm for Decision Tree InductionBrief Review of EntropySlide 13Attribute Selection: Information GainComputing Information-Gain for Continuous-Valued AttributesGain Ratio for Attribute Selection (C4.5)Gini Index (CART, IBM IntelligentMiner)Computation of Gini IndexComparing Attribute Selection MeasuresOther Attribute Selection MeasuresOverfitting and Tree PruningEnhancements to Basic Decision Tree InductionClassification in Large DatabasesScalability Framework for RainForestRainforest: Training Set and Its AVC SetsBOAT (Bootstrapped Optimistic Algorithm for Tree Construction)Presentation of Classification ResultsVisualization of a Decision Tree in SGI/MineSet 3.0Interactive Visual Mining by Perception-Based Classification (PBC)Slide 30Bayesian Classification: Why?Bayes’ Theorem: BasicsPrediction Based on Bayes’ TheoremClassification Is to Derive the Maximum PosterioriNaïve Bayes ClassifierNaïve Bayes Classifier: Training DatasetNaïve Bayes Classifier: An ExampleAvoiding the Zero-Probability ProblemNaïve Bayes Classifier: CommentsSlide 40Using IF-THEN Rules for ClassificationRule Extraction from a Decision TreeRule Induction: Sequential Covering MethodSequential Covering AlgorithmRule GenerationHow to Learn-One-Rule?Slide 47Model Evaluation and SelectionClassifier Evaluation Metrics: Confusion MatrixClassifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and SpecificityClassifier Evaluation Metrics: Precision and Recall, and F-measuresClassifier Evaluation Metrics: ExampleEvaluating Classifier Accuracy: Holdout & Cross-Validation MethodsEvaluating Classifier Accuracy: BootstrapEstimating Confidence Intervals: Classifier Models M1 vs. M2Estimating Confidence Intervals: Null HypothesisEstimating Confidence Intervals: t-testEstimating Confidence Intervals: Table for t-distributionEstimating Confidence Intervals: Statistical SignificanceModel Selection: ROC CurvesIssues Affecting Model SelectionSlide 62Ensemble Methods: Increasing the AccuracyBagging: Boostrap AggregationBoostingAdaboost (Freund and Schapire, 1997)Random Forest (Breiman 2001)Classification of Class-Imbalanced Data SetsSlide 69Summary (I)Summary (II)References (1)References (2)References (3)References (4)Slide 76CS412 Midterm Exam StatisticsIssues: Evaluating Classification MethodsPredictor Error MeasuresScalable Decision Tree Induction MethodsData Cube-Based Decision-Tree Induction1Data Mining: Concepts and Techniques (3rd ed.)— Chapter 8 —Jiawei Han, Micheline Kamber, and Jian PeiUniversity of Illinois at Urbana-Champaign &Simon Fraser University©2011 Han, Kamber & Pei. All rights reserved.3Chapter 8. Classification: Basic ConceptsClassification: Basic ConceptsDecision Tree InductionBayes Classification MethodsRule-Based ClassificationModel Evaluation and SelectionTechniques to Improve Classification Accuracy: Ensemble MethodsSummary4Supervised vs. Unsupervised LearningSupervised learning (classification)Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observationsNew data is classified based on the training setUnsupervised learning (clustering)The class labels of training data is unknownGiven a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data5Classification predicts categorical class labels (discrete or nominal)classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new dataNumeric Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applicationsCredit/loan approval:Medical diagnosis: if a tumor is cancerous or benignFraud detection: if a transaction is fraudulentWeb page categorization: which category it isPrediction Problems: Classification vs. Numeric Prediction6Classification—A Two-Step Process Model construction: describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attributeThe set of tuples used for model construction is training setThe model is represented as classification rules, decision trees, or mathematical formulaeModel usage: for classifying future or unknown objectsEstimate accuracy of the modelThe known label of test sample is compared with the classified result from the modelAccuracy rate is the percentage of test set samples that are correctly classified by the modelTest set is independent of training set (otherwise overfitting) If the accuracy is acceptable, use the model to classify new dataNote: If the test set is used to select models, it is called validation (test) set7Process (1): Model ConstructionTrainingDataNAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 noClassificationAlgorithmsIF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ Classifier(Model)8Process (2): Using the Model in Prediction ClassifierTestingDataNAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yesUnseen Data(Jeff, Professor, 4)Tenured?9Chapter 8. Classification: Basic ConceptsClassification: Basic ConceptsDecision Tree InductionBayes Classification MethodsRule-Based ClassificationModel Evaluation and SelectionTechniques to Improve Classification Accuracy: Ensemble MethodsSummary10Decision Tree Induction: An Exampleage?overcaststudent? credit rating?<=30>40no yesyesyes31..40nofairexcellentyesnoTraining data set: Buys_computerThe data set follows an example of Quinlan’s ID3 (Playing Tennis)Resulting tree:11Algorithm for Decision Tree InductionBasic algorithm (a greedy algorithm)Tree is constructed in a top-down recursive divide-and-conquer mannerAt start, all the training examples are at the rootAttributes are categorical (if continuous-valued,
View Full Document