NCSU ST 610 - A Data Mining Overview - D2684471

Home> Schools> North Carolina State University> Statistics (ST) > ST 610> A Data Mining Overview

NCSU ST 610 - A Data Mining Overview

School name North Carolina State University

Pages 8

Download Save

Unformatted text preview:

SIGKDD Explorations. Copyright  2000 ACM SIGKDD, January 2000. Volume 1, Issue 2 – page 97 Interface ’99: A Data Mining Overview Arnold Goodman The UCI Center for Statistical Consulting University of California, Irvine Irvine, CA 92697-5105 [email protected] ABSTRACT This personal overview of Interface ’99 is intended to communicate its meaning and relevance to SIGKDD, as well as provide valuable information on trends within the Interface for data miners seeking to learn more about statistics. In addition, it is the newest link in a bridge between the Interface and KDD begun by References 2-4 and the sessions on KDD at Interface ’98 and Interface ‘99. Keywords Review of Interface’99 conference, statistics. 1. INTRODUCTION KDD uses computer technology and statistics to discover knowledge buried deep within enormous databases that are usually heterogeneous and not from designed experiments. Although this situation is outside the bounds of statistical traditions, it is not outside the bounds of statistical techniques or thinking. KDD is actually part of the Interface of Computing and Statistics, and the Annual Symposia on the Interface are quite relevant to SIGKDD. In fact, the Keynote Addresses at Interfaces ’97, ’98 and ’99 all dealt with KDD: Jerry Friedman said, “Statistics is no longer the only data game in town”; David Rocke said, “The algorithm is the estimator”; and Leo Breiman said, “Accuracy vs. simplicity and high vs. low dimensionality should be analyzed for competing models of the data.” Reference 1 contains the Proceedings of Interface ’98, while References 2-4 provide summaries of it and a short history of the Interface. Reference 5 contains the Proceedings of Interface ’99, and this article provides a brief interpretive overview of it. Interface ’00 is in New Orleans on April 5-8, 2000 and features Modeling of the Earth’s Systems from the Physical to the Infrastructural with Sessions on Archeology, Agriculture, Atmospherics, Business, Environment, Health and Information Technology. It is Co-Chaired by Sallie Keller-McNulty, Vicki Lancaster and Sally Morton (http://neptuneandco.com/interface/). Interface ’01 will be in Orange County, California on June 14-16, 2001 and will feature Challenges for a New Century with Sessions on Mega Datasets, Computational Finance, Mining of Web Data, Statistical Models for Text, Combining Models, Image Analysis, Graphics, Visualization, Beyond Correlation, Bayesian Analysis and Decisions, Evaluation of Models, Software Engineering, Trees, Clustering and Statistical Computing Pioneers plus a Mini Conference on Bioinformatics at the Interface. It is Co-Chaired by Padhraic Smyth and the author (http://www.ics.uci.edu/interface/). 2. SUMMARY Interface ’99 was the 31st Interface Symposium and featured Models, Predictions and Computing. It was Co-Chaired by Kenneth Berk and Mohsen Pourahmadi. Session topics of most interest to SIGKDD were: Testing & Validation of Software, Classification & Regression Trees, Clustering & Classification, Bootstrap, Visualization, Markov Chain Monte Carlo, Image Analysis, Mixed-Effects Models, Data Mining and Computational Biology. In addition, John Elder and the author organized two KDD Sessions: The Best of KDD-98 with John Elder’s “An Overview of KDD-98,” Greg Ridgeway’s “The State of Boosting” and Pedro Domingos’ (Prize Paper) “Occam’s Two Razors;” and Data Mining on the Interface with Usama’s Fayyad’s (Tutorial) “Scaling to Large Databases and the Link between Statistics and Databases.” 3. PERSONAL PERSPECTIVE It is interesting that the Sessions on Testing & Validation of Software and on Mixed-Effects Models echoed the author’s solutions to problems in the early 1960’s, and that the Sessions on Visualization echoed his major professor’s solution to a problem of his in the early 1970’s. Two recent talks by recognized experts in Bayesian analysis have also provided the author with new insights into both the past and future of this flowering area. Evaluating statistical software has a long history, beginning with the 1962 checklist and scorecard that was developed by Robert McCornack and the author (Reference 6). This was the first scorecard for evaluating statistical software and perhaps the first scorecard for evaluating software in general. The actual estimation of mixed-effects models with both fixed and random components: first estimates the fixed component assuming simplicity and a knowledge of the random component, then estimates the random component assuming knowledge gained about the fixed component, and repeats this sequence iteratively until the situation sufficiently stabilizes. This is the application of a procedure very similar to one developed by the author in 1964 for a somewhat easier least squares problem (Reference 7). Herman Chernoff introduced a clever graphical method to represent vectors of data by facial expressions, called Chernoff’s Faces in the literature, around ten years before computer graphics and twenty years before computer visualization (Reference 8). This technique plus sequential analysis, another specialty of his, should prove quite useful to the explorations within KDD. The perspective of Arnold Zellner, who took Bayes into novel areas of econometrics long before it became fashionable, contrasted with that of Eric Horvitz, who is now taking Bayes intoSIGKDD Explorations. Copyright  2000 ACM SIGKDD, January 2000. Volume 1, Issue 2 – page 98 novel areas of computer science. They inspired the following conjecture: although the past of Bayesian analysis was enabled by methodology while bring static and non interactive, the future of Bayesian analysis might well be enabled by technology while being dynamic and interactive. 4. THE CONFERENCE CONTENT 4.1 Testing & validation of software “AETGWeb - A Combinatorial Approach to Designing Test Cases for Testing Software” by Sid R. Dalal, A. Jam, G. Patton and M. Rathi at Telcordia Technologies (Formerly Bellcore) Combinatorial design is a relatively new approach to test generation and AETGWeb uses a new set of combinatorial design algorithms to reduce the number of tests. In unit and system testing, it compares favorably to the standard experimental-design approaches “Statistical Reference Datasets (StRD) for Assessing the Numerical Accuracy of Statistical Software” by W.F. Guthrie,

View Full Document


School:
Email:
New Password:
Confirm Password:

NCSU ST 610 - A Data Mining Overview

Sign up for free to view:

Please select your school