CS 267: Applications of Parallel Computers Lecture 25: Data MiningLecture ScheduleN-Body AssignmentSlide 4OutlineData Mining OverviewWhat is Data Mining?Knowledge Discovery ProcessWhy Mine Data?Data Mining TasksClassificationClassification Example: Sky SurveyClusteringClustering ApplicationsAssociate Rule DiscoveryOther Data Mining ProblemsSerial Algorithms for ClassificationDecision Tree AlgorithmsTree InductionGINI Splitting CriterionSplitting Based on GINISplitting Based on INFOC4.5 ClassificationSLIQ ClassificationSLIQ ExampleSPRINTParallel Algorithms for ClassificationParallel Tree Construction: Approach 1Parallel Tree Construction: Approach 2Parallel Tree Construction: Hybrid ApproachContinuous DataPerformance Results from ScalParCSlide 3301/14/19 CS267, Yelick 1CS 267: Applications of Parallel ComputersLecture 25:Data MiningKathy YelickMaterial based on lecture by Vipin Kumar and Mahesh Joshihttp://www-users.cs.umn.edu/~mjoshi/hpdmtut/01/14/19 CS267, Yelick 2Lecture Schedule•12/3: 3 things Projects and performance analysis (N-body assignment observations) Data Mining HKN Review at 3:40•12/5: The Future of Parallel Computing David Bailey•12/13: CS267 Poster Session (2-4pm, Woz)•12/14: Final Papers due01/14/19 CS267, Yelick 3N-Body Assignment•Some observations on your N-Body assignments•Problems and pitfalls to avoid in final project•Performance analysis•Micro-benchmarks are good•To understand application performance, build up performance model from measured pieces, e.g., network performance•Noise is expected, but quantifying it is also useful•Means, alone, can be confusing•Median + variance is good•Carefully select problem sizes•Are they large enough to justify the # of processors?•What do real users want?•Can you vary the problem size in some reasonable way?01/14/19 CS267, Yelick 4N-Body Assignment•Minor comments on N-Body Results•Describe performance graphs – what is expected, surprising•Sanity check your numbers•Are you getting more than P time speedup on P processors?•Does the observed running time (“time command”) match total?•What is your Mflops rate? Is it between 10 and 90% of HW peak?•Be careful of different timers•Get-time-of-day is wall-clock time (charged for OS and others)•Clock is process time (Linux creates a process per thread)•RT clock on Cray is wall clock time•Check captions, titles, axes of figures/graphs •Run spell checker01/14/19 CS267, Yelick 5Outline•Overview of Data Mining•Serial Algorithms for Classification•Parallel Algorithms for Classification•Summary01/14/19 CS267, Yelick 6Data Mining Overview•What is Data Mining?•Data Mining Tasks•Classification•Clustering•Association Rules and Sequential Patterns•Regression•Deviation Detection01/14/19 CS267, Yelick 7What is Data Mining?•Several definitions:•Search for valuable information in large volumes of data•Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover useful rules•A step in the Knowledge Discovery in Databases (KDD) process01/14/19 CS267, Yelick 8Knowledge Discovery Process•Knowledge Discovery in Databases: identify valid, novel, useful, and understandable patterns in dataClean, collect, summarizeDatapreparationDataminingVerificationandevaluationDatawarehouseOperational DatabasesTrainingDataModel,patterns01/14/19 CS267, Yelick 9Why Mine Data?•Data collected and stored at enormous rate•Remote sensor on a satellite•Telescope scanning the skies•Microarrays generating gene expressions•Scientific simulations•Traditional techniques infeasible•Data mining for data reduction•Cataloging, classifying, segmenting•Help scientists formulate hypotheses01/14/19 CS267, Yelick 10Data Mining Tasks•Predictive Methods: Use variable to predict unknown or future values of other variables•Classification•Regression•Deviation Detection•Descriptive Methods: Find human-interpretable patterns that describe data•Clustering•Association Rule Discovery•Sequential Pattern Discovery01/14/19 CS267, Yelick 11Classification•Given a collection of records (training set) •Each record contains a set of attributes, on of which is the class•Find a model for class attributes as a function of the values of other attributes•Goal: previously unseen records should be accurately assigned a class•A test set is used to determine accuracy•Examples:•Direct marketing: targeted mailings based on buy/don’t class•Fraud detection: predict fraudulent use of credit cards, insurance, telephones, etc.•Sky survey cataloging: catalog objects based as star/galaxy01/14/19 CS267, Yelick 12Classification Example: Sky Survey•Currently 3K images•23Kx23K pixelsApproach•Segment the image•Measure image attributes – 40 per object•Model the class (star/galaxy or stage) based on the attributesImages from: http://aps.umn.edu01/14/19 CS267, Yelick 13Clustering•Given a set of data points:•Each has a set of attributes•A similarity measure among them•Find clusters such that:•Points in one cluster are more similar to each other than points in other clusters•Similarities measures are problem specific:•E.g., Euclidean distance for continuous data01/14/19 CS267, Yelick 14Clustering Applications•Market Segmentation:•Divide market into distinct subsets•Document clustering: •Find group of related documents, based on common keywords•Set in information retrieval•Financial market analysis•Find groups of companies with common stock behavior01/14/19 CS267, Yelick 15Associate Rule Discovery•Given a set of records, each containing set of items•Produce dependency rules that predict occurrences of an item based on others•Applications:•Marketing, sales promotion and shelf management•Inventory managementTID Items1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, MilkRules: {Milk} {Coke} {Diaper,Milk} Beer01/14/19 CS267, Yelick 16Other Data Mining Problems•Sequential Pattern Discovery •Given a set of objects, each with a timeline of events•Find rules that predict sequential dependencies•Example: patterns in telecommunications alarm logs•Regression: •Predict a value one variable given others•Assume a linear or non-linear model of dependence•Examples:•Predict sales amounts based on advertising expenditures•Predict wind velocities
View Full Document