Data Mining: ClassificationClassification and PredictionClassification vs. PredictionClassification—A Two-Step ProcessClassification Process (1): Model ConstructionClassification Process (2): Use the Model in PredictionSupervised vs. Unsupervised LearningSlide 8Issues (1): Data PreparationIssues (2): Evaluating Classification MethodsSlide 11Classification by Decision Tree InductionTraining DatasetOutput: A Decision Tree for “buys_computer”Algorithm for Decision Tree InductionAttribute Selection MeasureInformation Gain (ID3/C4.5)Information Gain in Decision Tree InductionAttribute Selection by Information Gain ComputationGini Index (IBM IntelligentMiner)Extracting Classification Rules from TreesAvoid Overfitting in ClassificationApproaches to Determine the Final Tree SizeEnhancements to basic decision tree inductionClassification in Large DatabasesScalable Decision Tree Induction Methods in Data Mining StudiesData Cube-Based Decision-Tree InductionPresentation of Classification ResultsSlide 29Bayesian Classification: Why?Bayesian TheoremBayesian classificationEstimating a-posteriori probabilitiesNaïve Bayesian ClassificationPlay-tennis example: estimating P(xi|C)Play-tennis example: classifying XThe independence hypothesis…Bayesian Belief Networks (I)Bayesian Belief Networks (II)Slide 42Neural NetworksA NeuronNetwork TrainingMulti-Layer PerceptronSlide 48Association-Based ClassificationSlide 50Other Classification MethodsInstance-Based MethodsThe k-Nearest Neighbor AlgorithmDiscussion on the k-NN AlgorithmCase-Based ReasoningRemarks on Lazy vs. Eager LearningGenetic AlgorithmsRough Set ApproachFuzzy SetsSlide 60What Is Prediction?Predictive Modeling in DatabasesRegress Analysis and Log-Linear Models in PredictionPrediction: Numerical DataPrediction: Categorical DataSlide 66Classification Accuracy: Estimating Error RatesBoosting and BaggingBoosting Technique (II) — AlgorithmSlide 70SummaryReferences (I)References (II)Data Mining: ClassificationClassification and PredictionWhat is classification? What is prediction?Issues regarding classification and predictionClassification by decision tree inductionBayesian ClassificationClassification by backpropagationClassification based on concepts from association rule miningOther Classification MethodsPredictionClassification accuracySummaryClassification: predicts categorical class labelsclassifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new dataPrediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applicationscredit approvaltarget marketingmedical diagnosistreatment effectiveness analysisClassification vs. PredictionClassification—A Two-Step Process Model construction: describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attributeThe set of tuples used for model construction: training setThe model is represented as classification rules, decision trees, or mathematical formulaeModel usage: for classifying future or unknown objectsEstimate accuracy of the modelThe known label of test sample is compared with the classified result from the modelAccuracy rate is the percentage of test set samples that are correctly classified by the modelTest set is independent of training set, otherwise over-fitting will occurClassification Process (1): Model ConstructionTrainingDataNAME RAN K YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 noClassificationAlgorithmsIF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ Classifier(Model)Classification Process (2): Use the Model in PredictionClassifierTestingDataNAME RANK YEAR S TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yesUnseen Data(Jeff, Professor, 4)Tenured?Supervised vs. Unsupervised LearningSupervised learning (classification)Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observationsNew data is classified based on the training setUnsupervised learning (clustering)The class labels of training data is unknownGiven a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the dataClassification and PredictionWhat is classification? What is prediction?Issues regarding classification and predictionClassification by decision tree inductionBayesian ClassificationClassification by backpropagationClassification based on concepts from association rule miningOther Classification MethodsPredictionClassification accuracySummaryIssues (1): Data PreparationData cleaningPreprocess data in order to reduce noise and handle missing valuesRelevance analysis (feature selection)Remove the irrelevant or redundant attributesData transformationGeneralize and/or normalize dataIssues (2): Evaluating Classification MethodsPredictive accuracySpeed and scalabilitytime to construct the modeltime to use the modelRobustnesshandling noise and missing valuesScalabilityefficiency in disk-resident databases Interpretability: understanding and insight provded by the modelGoodness of rulesdecision tree sizecompactness of classification rulesClassification and PredictionWhat is classification? What is prediction?Issues regarding classification and predictionClassification by decision tree inductionBayesian ClassificationClassification by backpropagationClassification based on concepts from association rule miningOther Classification MethodsPredictionClassification accuracySummaryClassification by Decision Tree InductionDecision tree A flow-chart-like tree structureInternal node denotes a test on an attributeBranch represents an outcome of the testLeaf nodes represent class labels or class distributionDecision tree generation consists of two phasesTree constructionAt start, all the training examples are at the rootPartition examples recursively based on selected attributesTree pruningIdentify and remove branches that reflect noise or outliersUse of decision tree:
View Full Document