Data Mining OverviewSlide 2Data Mining is …Data Mining is … (2)Data Mining - Alternative Names?What is Data Mining? Real Example from the NBAData Mining Defining CharacteristicsData Mining, circa 1963Since 1963Slide 10Why Data Mining?MultidisciplinaryWhat Is Data Mining?Confusing TerminologyRequired ExpertiseNuggetsData Mining: History of the FieldKnowledge Discovery in Databases: ProcessSteps of a KDD ProcessData Mining and Business IntelligenceMulti-Dimensional View of Data MiningWhy Mining in Data Warehouses?Ingredients of an Effective KDD ProcessPotential ApplicationsMarket Analysis and ManagementCorporate Analysis & Risk ManagementFraud Detection & Mining Unusual PatternsOther ApplicationsExample: RetailingExample: Aviation SafetyData Mining & Individual PredictionsMore cartoonsData Mining: Classification SchemesWhat Can Data Mining Do?Frequent Patterns & Association RulesSequential Patterns/AssociationsMore Pattern/Association UsesClusteringDeviation DetectionMore Uses for Clusters/OutliersClassificationSlide 43More Classification UsesWar Stories: Warehouse Product AllocationWar Stories: Inventory ForecastingNecessity for Data MiningData Mining ComplicationsMajor Issues in Data MiningAre All the “Discovered” Patterns Interesting?Can We Find All and Only Interesting Patterns?Related Techniques: OLAP On-Line Analytical ProcessingRelated Techniques: VisualizationData Mining and VisualizationResearch Issues in Data MiningEffectivenessEfficiencyApplicationsTheory: Foundation for Data MiningAcknowledgements/SourcesData Mining OverviewData Mining is …“advanced methods for exploring and modeling relationships in large amounts of data.” (SAS)“the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.” (Gartner Group)the “extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.” (Clifton)Data Mining is … (2)“the exploration and analysis, by automatic and semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules” (Michael Berry and Gordon Linoff)“the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad, Piatetsky-Shapiro, Smyth)Data Mining - Alternative Names?Data MiningKnowledge MiningKnowledge Discoveryin DatabasesData ArchaeologyData DredgingDatabase MiningKnowledge ExtractionData Pattern ProcessingInformation HarvestingSiftwareWhat is Data Mining?Real Example from the NBAPlay-by-play information recorded by teamsWho is on the courtWho shootsResultsCoaches want to know what works bestPlays that work well against a given teamGood/bad player matchupsAdvanced Scout (from IBM Research) is a data mining tool to answer these questionsStarks+Houston+Ward playingData Mining Defining Characteristics1. The DataMassive, operational, and opportunistic2. The Users and SponsorsBusiness decision support3. The MethodologyComputer-intensive “ad hockery”Multidisciplinary lineageData Mining, circa 1963 IBM 7090 600 cases“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”Since 1963Moore’s Law:The information density on silicon-integrated circuits doubles every 18 to 24 months.Cost of storageCost of processing powerParallel computingAdvances in DBMS and Data WarehousingAdvances in AIAdvances in computing algorithmsAdvances in statistics10electronic point-of-sale datahospital patient registriescatalog orders bank transactionsremote sensing images tax returnsairline reservations credit card chargesstock trades OLTP telephone callsData DelugeWhy Data Mining?Evolution of database technologyTo collect a large amount of data primitive file processingTo store and query data efficiently DBMSNew challenges: huge amount of data, how to analyze and understand?Data miningMultidisciplinaryDatabasesStatisticsPatternRecognitionKDDMachineLearningAINeurocomputingData MiningWhat Is Data Mining?ITComplicated database queriesMLInductive learning from examplesStat What we were taught not to doConfusing Terminology“Bias”•Statistics: the expected difference between an estimator and what is being estimated•Neurocomputing: the constant term in a linear combination•Machine Learning: a reason for favoring any model that does not fit the data perfectlyRequired ExpertiseThe domain expert understands the particulars of the business or scientific problem; the relevant background knowledge, context, and terminology; and the strengths and deficiencies of the current solution (if a current solution exists). The data expert understands the structure, size, and format of the data. The analytical expert understands the capabilities and limitations of the methods that may be relevant to the problem.Nuggets“If you’ve got terabytes of data, and you’re relying on data mining to find interesting things in there for you, you’ve lost before you’ve even begun. You really need people who understand what it is they are looking for – and what they can do with it once they find it.” (Herb Edelstein)Data Mining:History of the FieldThe term “data mining” has been around since at least 1983 – as a pejorative term in the statistics communityKnowledge Discovery in Databases workshops started in 1989Now a conference under the auspices of ACM SIGKDDIEEE conference series started 2001Knowledge Discovery in Databases: ProcessJian Pei; adapted from:U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview”Steps of a KDD Process Learning the application domainrelevant prior knowledge and goals of applicationCreating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformationFind useful features, dimensionality/variable reduction, invariant representation.Choosing functions of data mining Summarization, classification, regression, association, clustering.Choosing the mining algorithm(s)Data
View Full Document