Unformatted text preview:

1CS490D:Introduction to Data MiningProf. Chris CliftonMarch 8, 2004Midterm ReviewMidterm Wednesday, March 10, in class. Open book/notes.CS490D Midterm Review 2Seminar Thursday:Support Vector Machines• Massive Data Mining via Support Vector Machines• Hwanjo Yu, University of Illinois– Thursday, March 11, 2004– 10:30-11:30– CS 111• Support Vector Machines for:– classifying from large datasets– single-class classification– discriminant feature combination discovery2CS490D Midterm Review 3Course Outlinewww.cs.purdue.edu/~clifton/cs490d1. Introduction: What is data mining?– What makes it a new and unique discipline?– Relationship between Data Warehousing, On-line Analytical Processing, and Data Mining2. Data mining tasks - Clustering, Classification, Rule learning, etc.3. Data mining process: Data preparation/cleansing, task identification– Introduction to WEKA4. Association Rule mining5. Association rules - different algorithm types6. Classification/Prediction7. Classification - tree-based approaches8. Classification - Neural NetworksMidterm9. Clustering basics10.Clustering - statistical approaches11.Clustering - Neural-net and other approaches12.More on process - CRISP-DM– Preparation for final project13.Text Mining14.Multi-Relational Data Mining15.Future trendsFinalText: Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, August 2000.CS490D Midterm Review 4Data Mining: Classification Schemes• General functionality– Descriptive data mining – Predictive data mining• Different views, different classifications– Kinds of data to be mined– Kinds of knowledge to be discovered– Kinds of techniques utilized– Kinds of applications adapted3CS490D Midterm Review 5adapted from:U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT PressDataTargetDataSelectionKnowledgeKnowledgePreprocessedDataPatternsData MiningInterpretation/EvaluationKnowledge Discovery in Databases: ProcessPreprocessingCS490D Midterm Review 6What Can Data Mining Do?• Cluster• Classify– Categorical, Regression• Summarize– Summary statistics, Summary rules• Link Analysis / Model Dependencies– Association rules• Sequence analysis– Time-series analysis, Sequential associations• Detect Deviations4CS490D Midterm Review 7What is Data Warehouse?• Defined in many different ways, but not rigorously.– A decision support database that is maintained separately from the organization’s operational database– Support information processing by providing a solid platform of consolidated, historical data for analysis.• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon• Data warehousing:– The process of constructing and using data warehousesCS490D Midterm Review 8Example of Star Schematime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcitystate_or_provincecountrylocationSales Fact Tabletime_keyitem_keybranch_keylocation_keyunits_solddollars_soldavg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranch5CS490D Midterm Review 9From Tables and Spreadsheets to Data Cubes• A data warehouse is based on a multidimensional data model which views data in the form of a data cube• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions– Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) – Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables• In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.CS490D Midterm Review 10Cube: A Lattice of Cuboidsalltime item location suppliertime,itemtime,locationtime,supplieritem,locationitem,supplierlocation,suppliertime,item,locationtime,item,suppliertime,location,supplieritem,location,suppliertime, item, location, supplier0-D(apex) cuboid1-D cuboids2-D cuboids3-D cuboids4-D(base) cuboid6CS490D Midterm Review 11A Sample Data CubeTotal annual salesof TVs in U.S.A.DateProductCountrysumsumTVVCRPC1Qtr2Qtr3Qtr4QtrU.S.ACanadaMexicosumCS490D Midterm Review 12Warehouse Summary• Data warehouse• A multi-dimensional model of a data warehouse– Star schema, snowflake schema, fact constellations– A data cube consists of dimensions & measures• OLAP operations: drilling, rolling, slicing, dicing and pivoting• OLAP servers: ROLAP, MOLAP, HOLAP• Efficient computation of data cubes– Partial vs. full vs. no materialization– Multiway array aggregation– Bitmap index and join index implementations• Further development of data cube technology– Discovery-drive and multi-feature cubes– From OLAP to OLAM (on-line analytical mining)7CS490D Midterm Review 13Data Preprocessing• Data in the real world is dirty– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data• e.g., occupation=“”– noisy: containing errors or outliers• e.g., Salary=“-10”– inconsistent: containing discrepancies in codes or names• e.g., Age=“42” Birthday=“03/07/1997”• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate recordsCS490D Midterm Review 15Multi-Dimensional Measure of Data Quality• A well-accepted multidimensional view:– Accuracy– Completeness– Consistency– Timeliness– Believability– Value added– Interpretability– Accessibility• Broad categories:– intrinsic, contextual, representational, and accessibility.8CS490D Midterm Review 16Major Tasks in Data Preprocessing• Data cleaning– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies• Data integration– Integration of multiple databases, data cubes, or files• Data transformation– Normalization and aggregation• Data reduction– Obtains reduced representation in volume but produces the same or similar analytical results• Data discretization– Part of data reduction but with particular importance, especially for numerical dataCS490D Midterm Review 17How to Handle Missing Data?•


View Full Document

Purdue CS 490D - Midterm Review

Download Midterm Review
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Midterm Review and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Midterm Review 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?