WMU CS 595 - A Database Perspective on Knowledge Discovery

Unformatted text preview:

58 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACMTomasz Imielinski and Heikki MannilaDATABASE MINING IS NOT SIMPLY ANOTHERbuzzword for statistical data analysis orinductive learning. Database mining setsnew challenges to database technology:new concepts and methods are neededfor query languages, basic operations,and query processing strategies. Themost important new component is the ad hoc nature of knowledge and data dis-covery (KDD) queries and the need for efficient query compilation into a multi-tude of existing and new data analysis methods. Hence, database mining buildsupon the existing body of work in statistics and machine learning but providescompletely new functionalities.The current generation of database systems are designed mainly to support busi-ness applications. The success of Structured Query Language (SQL) has capitalizedon a small number of primitives sufficient to support a vast majority of such appli-cations. Unfortunately, these primitives are not sufficient to capture the emerg-ing family of new applications dealing with knowledge discovery.Most current KDD systems offer isolated discovery features using treeinducers, neural nets, and rule discovery algorithms. Such systems cannot beembedded into a large application and typically offer just one knowledge dis-The concept of datamining as a queryingprocess and the firststeps toward efficientdevelopment of knowl-edge discovery applica-tions are discussed.A Database Perspective on Knowledge DiscoveryCOMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 59covery feature. The same is true for most online analyt-ical processing (OLAP) tools. We call such systems first-generation database mining systems. The currentsituation is very similar to the situation in database man-agement systems in the early 1960s, when each applica-tion had to be built from scratch, without the benefit ofdedicated database primitives provided later by SQLand relational database APIs (Application Program-ming Interface). In fact, today’s techniques of data min-ing would more appropriately be described as “filemining” since they assume a loose coupling between adata-mining engine and a database. In most cases, thisinterface has a form of two simple commands: “readfrom” and “write to”, just as Cobol programs interactedwith large data files 30 years ago. Additionally, the KDDfield currently suffers from lack of identi-ty: there is no consensus on whetherKDD is an area of its own.Some claim knowledge discovery issimply machine learning with large datasets, and that the database componentof the KDD is essentially maximizingperformance of mining operations run-ning in the top of large persistent datasets and involving expensive I/O. It is, ofcourse, important work to close the gapbetween the inductive learning toolsand the database systems currently avail-able. However, although improving per-formance is an important issue, it isprobably not sufficient to trigger a qual-itative change in system capabilities.Moreover, many database management system(DBMS) performance enhancements important fordatabase mining are also desirable in general. Suchfeatures include parallel query execution, in-memoryevaluation as well as optimized support for samplingand aggregate query operations such as “Count,”“Average,” and “Variance.” These features are of gen-eral interest and their importance to database miningcan be viewed only as a side effect of basic research inDBMS. Moreover, incremental improvements ofexisting DBMSs to better suit KDD applications willmost likely be insufficient, since the current DBMSswere primarily targeted at the different classes ofapplications.A historical analogy is perhaps in order; one mayargue that performance improvement of I/O opera-tions alone would have never triggered the DBMSresearch field nearly 30 years ago. Query languages,query optimization, and transaction processing werethe driving ideas behind the tremendous growth ofdatabase field in the last three decades. To be morespecific, it was the ad hoc nature of querying that cre-ated a challenge to build general-purpose query opti-mizers. If queries were predefined and their numberwas limited it would be sufficient to develop highlytuned, stand-alone, library routines. We believe data-base mining can follow a similar path of development.New Research FrontierThe following are two research scenar-ios: a short-term vision—that is strictlyperformance driven, and well underway—and another, long term, that isonly beginning now and possiblydefines a new frontier for databaseresearch.Research Program—Short TermThe short-term research program callsfor efficient algorithms implementingmachine learning tools on the top oflarge databases and utilizing the existingDBMS support.The implementation of classification algorithms(say, C4.5) or neural networks on top of a large data-base requires tighter coupling with the database systemand intelligent use of indexing techniques [4, 6]. Forexample, training a classifier on a large training setstored in a database may require multiple passesthrough the data using different orderings betweenattributes. This can be implemented by utilizing DBMSsupport for aggregate operations, indexes and data-base sorting (order by). Clustering may require effi-cient implementations of nearest neighbor algorithmson the top of large databases. Finally, generation ofassociation rules can be performed in many differentTERRY WIDENER60 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACMways, depending on the amount of main memory avail-able. There have been a growing number of papers onthis subject at recent VLDB and SIGMOD conferences.Some of these techniques are finding their way intoproducts such as IBM’s Intelligent Miner.Research Program—Long TermDatabase mining should learn from the general expe-rience of DBMS field and follow one of the key DBMSparadigms: building optimizing compilers for ad hocqueries and embedding queries in application pro-gramming interfaces. Thus, the focus should be onincreasing programmer productivity for KDD applica-tion development. In fact, there is a growing need forKnowledge and Data Discovery Management Systems(KDDMS), or second-generation database mining sys-tems, to manage KDD applications just as DBMSs suc-cessfully manage business applications.Queries in KDDMS, however, have to be much moregeneral than SQL; similarly, the queried objects have tobe far more complex than


View Full Document

WMU CS 595 - A Database Perspective on Knowledge Discovery

Download A Database Perspective on Knowledge Discovery
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Database Perspective on Knowledge Discovery and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Database Perspective on Knowledge Discovery 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?