NCSU ST 610 - A Review of Software Packages for Data Mining

Unformatted text preview:

© 2003 American Statistical Association DOI: 10.1198/0003130032486The American Statistician, November 2003, Vol. 57, No. 4 290A Review of Software Packages for Data MiningDominique HAUGHTON, Joel DEICHMANN, Abdolreza ESHGHI,Selin SAYEK, Nicholas TEEBAGY, and Heikki TOPIWe present to the statistical community an overview of five datamining packages with the intent of leaving the reader with asense of the different capabilities, the ease or difficulty of use,and the user interface of each package. We are not attemptingto perform a controlled comparison of the algorithms in eachpackage to decide which has the strongest predictive power,but instead hope to give an idea of the approach to predictivemodeling used in each of them. The packages are compared inthe areas of descriptive statistics and graphics, predictive mod-els, and association (market basket) analysis.As expected, the packages affiliated with the most popularstatistical software packages (SAS and SPSS) provide the broad-est range of features with remarkably similar modeling and in-terface approaches, whereas the other packages all have theirspecial sets of features and specific target audiences whom webelieve each of the packages will serve well. It is essential thatan organization considering the purchase of a data mining pack-age carefully evaluate the available options and choose the onethat provides the best fit with its particular needs.KEY WORDS: Clementine; Ghostminer; Quadstone; SASEnterprise Miner; XLMiner.1. INTRODUCTIONThe term “data mining” has come to refer to a set of tech-niques that originated in statistics, computer science, and re-lated areas that are typically used in the context of large datasets.The purpose of data mining is to reveal previously hidden asso-ciations between variables that are potentially relevant for mana-gerial decision making. The exploratory and modeling tech-niques used in data mining are familiar to many statisticiansand include exploratory tools such as histograms, scatterplots,boxplots, and analytic tools such as regression, neural nets, anddecision trees.This article’s objective is to present to the statistical commu-nity an overview of five data mining packages, and to leave thereader with a sense of the different capabilities, the ease or dif-ficulty of use, and the user interface of each package. We arenot attempting to perform a controlled comparison of the algo-Dominique Haughton, Joel Deichmann, Abdolreza Eshghi, Selin Sayek,Nicholas Teebagy, and Heikki Topi are members of the Data Analytics Re-search Team, Bentley College, 175 Forest Street, Waltham, MA 02452 (E-mail: [email protected]). Selin Sayek is also affiliated with BilkentUniversity, Turkey. The authors thank each of the vendors of the reviewedpackages for their assistance in dealing with installation concerns and fortheir very prompt replies to all of the authors’ questions. Our sincere thanksalso go to Section Editor Joe Hilbe and his editorial staff for their carefulreading of the manuscript and their support for the project, as well as fortheir many useful comments. We also thank Carter Rakovski and other mem-bers of the Academic Technology Center at Bentley College for all their help.rithms in each package to decide which has the strongest pre-dictive power, but instead aim to give an idea of the approach topredictive modeling used in each of them.The article is structured as follows: we first outline the meth-odology we used to evaluate the packages and give a summaryof key characteristics of each package. We continue by focus-ing on descriptive statistics and exploratory graphs. The sec-tion that follows is devoted to predictive modeling, coveringmodel building and assessment. A section on association (mar-ket basket) analysis is then provided, followed by a conclusion.2. METHODOLOGYThe list of packages we have selected for this review is byno means exhaustive. We have chosen to cover the data miningpackages associated with the two leading statistical packages,SAS and SPSS. We also decided to review two “stand-alone”packages, GhostMiner and Quadstone, and an Excel add-on,XLMiner.We compare the packages in the areas of descriptive statis-tics and graphics, predictive models, and association (marketbasket) analysis. Predictive modeling is one of the main appli-cations of data mining, and exploratory descriptive analysesalways precede modeling efforts. Association analysis, in which“baskets” of goods purchased together are identified, is alsovery commonly used.For the descriptive and modeling analysis, we used the Di-rect Marketing Educational Foundation dataset 2, merged withCensus geo-demographic variables from dataset 6 (www.the-dma.org/dmef). The dataset contains 19,185 observations andconcerns a business with multiple divisions, each mailing dif-ferent catalogs to a unified customer database. The target vari-able, BUY10, equals unity if a customer made a purchase fromthe January 1996 division D catalog, zero if not. Data available(through June 1995) as potential predictors, for the whole busi-ness and each division, include: life-to-date orders, dollars, andnumber of items; orders and dollars in the most recent 6, 12,24, and 36 months; recency of first and latest purchases; pay-ment method and minimal demographics. Census geo-demo-graphic variables give race, population, age profiles, as well asinformation on property values at the zip-plus-four level. Thenumber of candidate predictor variables is nearly 200, repre-senting a realistic situation in database marketing situations.For our association analysis, we chose to use the DirectMarketing Educational Foundation’s Bookbinders Club Casedataset including data from 1,580 customers.A typical hardware environment used in our tests was an800MHz IBM A22m with 256 MB RAM (except for Quadstone,which required 512MB of RAM) and a 30 GB hard drive.3. SUMMARY OF KEY CHARACTERISTICSTable 1 presents a brief summary of the main characteristicsof the packages reviewed here. Later sections will discuss manyof the key features and provide more details.We were able to obtain pricing information for most of thepackages. An academic server license for Enterprise Miner isavailable for $40,000–100,000, and a mainframe license for$47,000–222,000. Commercial licenses cost $119,000–281,000(mainframe $140,000–629,000). GhostMiner costs $2,500–30,000


View Full Document

NCSU ST 610 - A Review of Software Packages for Data Mining

Download A Review of Software Packages for Data Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Review of Software Packages for Data Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Review of Software Packages for Data Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?