Berkeley COMPSCI C267 - Data Mining - D137124

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Data Mining

DOC PREVIEW

Berkeley COMPSCI C267 - Data Mining

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 33

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267: Applications of Parallel Computers Lecture 25: Data MiningLecture ScheduleN-Body AssignmentSlide 4OutlineData Mining OverviewWhat is Data Mining?Knowledge Discovery ProcessWhy Mine Data?Data Mining TasksClassificationClassification Example: Sky SurveyClusteringClustering ApplicationsAssociate Rule DiscoveryOther Data Mining ProblemsSerial Algorithms for ClassificationDecision Tree AlgorithmsTree InductionGINI Splitting CriterionSplitting Based on GINISplitting Based on INFOC4.5 ClassificationSLIQ ClassificationSLIQ ExampleSPRINTParallel Algorithms for ClassificationParallel Tree Construction: Approach 1Parallel Tree Construction: Approach 2Parallel Tree Construction: Hybrid ApproachContinuous DataPerformance Results from ScalParCSlide 3301/14/19 CS267, Yelick 1CS 267: Applications of Parallel ComputersLecture 25:Data MiningKathy YelickMaterial based on lecture by Vipin Kumar and Mahesh Joshihttp://www-users.cs.umn.edu/~mjoshi/hpdmtut/01/14/19 CS267, Yelick 2Lecture Schedule•12/3: 3 things Projects and performance analysis (N-body assignment observations) Data Mining HKN Review at 3:40•12/5: The Future of Parallel Computing David Bailey•12/13: CS267 Poster Session (2-4pm, Woz)•12/14: Final Papers due01/14/19 CS267, Yelick 3N-Body Assignment•Some observations on your N-Body assignments•Problems and pitfalls to avoid in final project•Performance analysis•Micro-benchmarks are good•To understand application performance, build up performance model from measured pieces, e.g., network performance•Noise is expected, but quantifying it is also useful•Means, alone, can be confusing•Median + variance is good•Carefully select problem sizes•Are they large enough to justify the # of processors?•What do real users want?•Can you vary the problem size in some reasonable way?01/14/19 CS267, Yelick 4N-Body Assignment•Minor comments on N-Body Results•Describe performance graphs – what is expected, surprising•Sanity check your numbers•Are you getting more than P time speedup on P processors?•Does the observed running time (“time command”) match total?•What is your Mflops rate? Is it between 10 and 90% of HW peak?•Be careful of different timers•Get-time-of-day is wall-clock time (charged for OS and others)•Clock is process time (Linux creates a process per thread)•RT clock on Cray is wall clock time•Check captions, titles, axes of figures/graphs •Run spell checker01/14/19 CS267, Yelick 5Outline•Overview of Data Mining•Serial Algorithms for Classification•Parallel Algorithms for Classification•Summary01/14/19 CS267, Yelick 6Data Mining Overview•What is Data Mining?•Data Mining Tasks•Classification•Clustering•Association Rules and Sequential Patterns•Regression•Deviation Detection01/14/19 CS267, Yelick 7What is Data Mining?•Several definitions:•Search for valuable information in large volumes of data•Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover useful rules•A step in the Knowledge Discovery in Databases (KDD) process01/14/19 CS267, Yelick 8Knowledge Discovery Process•Knowledge Discovery in Databases: identify valid, novel, useful, and understandable patterns in dataClean, collect, summarizeDatapreparationDataminingVerificationandevaluationDatawarehouseOperational DatabasesTrainingDataModel,patterns01/14/19 CS267, Yelick 9Why Mine Data?•Data collected and stored at enormous rate•Remote sensor on a satellite•Telescope scanning the skies•Microarrays generating gene expressions•Scientific simulations•Traditional techniques infeasible•Data mining for data reduction•Cataloging, classifying, segmenting•Help scientists formulate hypotheses01/14/19 CS267, Yelick 10Data Mining Tasks•Predictive Methods: Use variable to predict unknown or future values of other variables•Classification•Regression•Deviation Detection•Descriptive Methods: Find human-interpretable patterns that describe data•Clustering•Association Rule Discovery•Sequential Pattern Discovery01/14/19 CS267, Yelick 11Classification•Given a collection of records (training set) •Each record contains a set of attributes, on of which is the class•Find a model for class attributes as a function of the values of other attributes•Goal: previously unseen records should be accurately assigned a class•A test set is used to determine accuracy•Examples:•Direct marketing: targeted mailings based on buy/don’t class•Fraud detection: predict fraudulent use of credit cards, insurance, telephones, etc.•Sky survey cataloging: catalog objects based as star/galaxy01/14/19 CS267, Yelick 12Classification Example: Sky Survey•Currently 3K images•23Kx23K pixelsApproach•Segment the image•Measure image attributes – 40 per object•Model the class (star/galaxy or stage) based on the attributesImages from: http://aps.umn.edu01/14/19 CS267, Yelick 13Clustering•Given a set of data points:•Each has a set of attributes•A similarity measure among them•Find clusters such that:•Points in one cluster are more similar to each other than points in other clusters•Similarities measures are problem specific:•E.g., Euclidean distance for continuous data01/14/19 CS267, Yelick 14Clustering Applications•Market Segmentation:•Divide market into distinct subsets•Document clustering: •Find group of related documents, based on common keywords•Set in information retrieval•Financial market analysis•Find groups of companies with common stock behavior01/14/19 CS267, Yelick 15Associate Rule Discovery•Given a set of records, each containing set of items•Produce dependency rules that predict occurrences of an item based on others•Applications:•Marketing, sales promotion and shelf management•Inventory managementTID Items1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, MilkRules: {Milk}  {Coke} {Diaper,Milk}  Beer01/14/19 CS267, Yelick 16Other Data Mining Problems•Sequential Pattern Discovery •Given a set of objects, each with a timeline of events•Find rules that predict sequential dependencies•Example: patterns in telecommunications alarm logs•Regression: •Predict a value one variable given others•Assume a linear or non-linear model of dependence•Examples:•Predict sales amounts based on advertising expenditures•Predict wind velocities

View Full Document

Berkeley COMPSCI C267 - Data Mining

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Berkeley COMPSCI C267 - Data Mining

Sign up for free to view:

Please select your school