Purdue CS 59000 - Data Mining - D2588624

Home> Schools> Purdue University> Computer Sciences (CS) > CS 59000> Data Mining

DOC PREVIEW

Purdue CS 59000 - Data Mining

School name Purdue University

Course Cs 59000- Topics in Computer Sciences

Pages 72

This preview shows page 1-2-3-4-5-34-35-36-37-68-69-70-71-72 out of 72 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 72 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS590D: Data Mining Prof. Chris CliftonCourse Outline http://www.cs.purdue.edu/~clifton/cs590dData Mining: Classification SchemesKnowledge Discovery in Databases: ProcessData PreprocessingMajor Tasks in Data PreprocessingHow to Handle Missing Data?How to Handle Noisy Data?Data TransformationData Transformation: NormalizationData Reduction StrategiesPrincipal Component AnalysisNumerosity ReductionRegress Analysis and Log-Linear ModelsSamplingDiscretizationEntropy-Based DiscretizationSegmentation by Natural PartitioningAssociation RulesThe Apriori Algorithm—An ExampleDIC: Reduce Number of ScansPartition: Scan Database Only TwiceDHP: Reduce the Number of CandidatesFP-treeMax-patternsFrequent Closed PatternsMultiple-level Association RulesQuantitative Association RulesInterestingness Measure: Correlations (Lift)Anti-Monotonicity in Constraint-Based MiningConvertible ConstraintsWhat Is Sequential Pattern Mining?ClassificationClassification: Use the Model in PredictionBayes’ TheoremNaïve Bayes ClassifierThe k-Nearest Neighbor AlgorithmDecision TreeAlgorithm for Decision Tree InductionAttribute Selection Measure: Information Gain (ID3/C4.5)Artificial Neural Networks: A NeuronArtificial Neural Networks: TrainingSVM – Support Vector MachinesGeneral SVMMappingExample of polynomial kernel.Regress Analysis and Log-Linear Models in PredictionBagging and BoostingClusteringSimilarity and Dissimilarity Between ObjectsBinary VariablesThe K-Means Clustering MethodThe K-Medoids Clustering MethodHierarchical ClusteringBIRCH (1996)Density-Based Clustering MethodsCLIQUE: The Major StepsCOBWEB Clustering MethodSelf-organizing feature maps (SOMs)Data Generalization and Summarization-based CharacterizationCharacterization: Data Cube ApproachA Sample Data CubeIceberg CubeTop-k AverageWhat is Concept Description?Attribute-Oriented Induction: Basic AlgorithmClass Characterization: An ExampleExample: Analytical Characterization (cont’d)Example: Analytical characterization (2)Measuring the Central TendencyMeasuring the Dispersion of DataTest Taking HintsCS590D: Data MiningProf. Chris CliftonMarch 3, 2005Midterm ReviewMidterm Thursday, March 10, 19:00-20:30, CS G066. Open book/notes.2Course Outlinehttp://www.cs.purdue.edu/~clifton/cs590d1. Introduction: What is data mining?–What makes it a new and unique discipline?–Relationship between Data Warehousing, On-line Analytical Processing, and Data Mining–Data mining tasks - Clustering, Classification, Rule learning, etc.2. Data mining process –Task identification–Data preparation/cleansing–Introduction to WEKA3. Association Rule mining–Problem Description–Algorithms4. Classification / Prediction–Bayesian–Tree-based approaches–Regression–Neural Networks5. Clustering–Distance-based approaches–Density-based approaches–Neural-Networks, etc.6. Concept Description–Attribute-Oriented Induction–Data Cubes7. More on process - CRISP-DMMidtermPart II: Current Research9. Sequence Mining10. Time Series11. Text Mining12. Multi-Relational Data Mining13. Suggested topics, project presentations, etc.Text: Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, August 2000.CS590D Review 3Data Mining: Classification Schemes•General functionality–Descriptive data mining –Predictive data mining•Different views, different classifications–Kinds of data to be mined–Kinds of knowledge to be discovered–Kinds of techniques utilized–Kinds of applications adaptedCS590D Review 4adapted from:U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT PressDataTargetDataSelectionKnowledgeKnowledgePreprocessedDataPatternsData MiningInterpretation/EvaluationKnowledge Discovery in Databases: ProcessPreprocessingCS590D Review 6Data Preprocessing•Data in the real world is dirty–incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data•e.g., occupation=“”–noisy: containing errors or outliers•e.g., Salary=“-10”–inconsistent: containing discrepancies in codes or names•e.g., Age=“42” Birthday=“03/07/1997”•e.g., Was rating “1,2,3”, now rating “A, B, C”•e.g., discrepancy between duplicate recordsCS590D Review 9Major Tasks in Data Preprocessing•Data cleaning–Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies•Data integration–Integration of multiple databases, data cubes, or files•Data transformation–Normalization and aggregation•Data reduction–Obtains reduced representation in volume but produces the same or similar analytical results•Data discretization–Part of data reduction but with particular importance, especially for numerical dataCS590D Review 10How to Handle Missing Data?•Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.•Fill in the missing value manually: tedious + infeasible?•Fill in it automatically with–a global constant : e.g., “unknown”, a new class?! –the attribute mean–the attribute mean for all samples belonging to the same class: smarter–the most probable value: inference-based such as Bayesian formula or decision treeCS590D Review 11How to Handle Noisy Data?•Binning method:–first sort data and partition into (equi-depth) bins–then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.•Clustering–detect and remove outliers•Combined computer and human inspection–detect suspicious values and check by human (e.g., deal with possible outliers)•Regression–smooth by fitting the data into regression functionsCS590D Review 12Data Transformation•Smoothing: remove noise from data•Aggregation: summarization, data cube construction•Generalization: concept hierarchy climbing•Normalization: scaled to fall within a small, specified range–min-max normalization–z-score normalization–normalization by decimal scaling•Attribute/feature construction–New attributes constructed from the given onesCS590D Review 13Data Transformation: Normalization•min-max normalization•z-score normalization•normalization by decimal scalingAAAAAAminnewminnewmaxnewminmaxminvv _)__(' AAdevstand_meanvv'jvv10'Where

View Full Document