0018-9162/99/$10.00 © 1999 IEEE2 ComputerAlthough there have been many data-miningmethodologies and systems developed inrecent years, we contend that by and large,present mining models lack human involve-ment, particularly in the form of guidanceand user control. We believe that data mining is mosteffective when the computer does what it does best—like searching large databases or counting—and usersdo what they do best, like specifying the current min-ing session’s focus. This division of labor is bestachieved through constraint-based mining, in whichthe user provides restraints that guide a search.1Mining can also be improved by employing a multi-dimensional, hierarchical view of the data. Currentdata warehouse systems have provided a fertileground for systematic development of this multidi-mensional mining.2Together, constraint-based andmultidimensional techniques can provide a more adhoc, query-driven process that effectively exploits thesemantics of data than those supported by currentstand-alone data-mining systems.AD HOC AND QUERY DRIVENAn ad hoc and query-driven data-mining system canbe more effective because it better fits queries to theuser’s intentions. It can make the process of inferringknowledge more efficient by letting a query optimizerdeliver high-performance, interactive mining systemsthat encourage exploratory mining and analysis.Such a data-mining system incorporates two capa-bilities, which also distinguish it from a statistical-analysis program or a machine-learning system.3First,it should offer an ad hoc mining query language,which is a high-level declarative language comparableto the Structured Query Language (SQL) for relationaldatabase management systems. Such a declarativemining language lets users express• the part of the database to be mined (called theminable view1), • the type of pattern/rule to be mined, and • the properties that the patterns should satisfy.These patterns should include not only numerical con-straints on statistical properties (like support, confi-dence, and correlation), but also those based onattribute domains, classes, and aggregates,1such as“I.type = ‘snacks’ and avg(I.price) < 100.”Second, a data-mining system should support effi-cient processing and optimization of mining queriesby providing a sophisticated mining-query optimizer.Such an optimizer exploits the various constraintsstated in the user-specified mining query and theirproperties to generate access plans that guarantee alevel of performance commensurate with the con-straints in the query. CONSTRAINTS: ESSENTIALS FOR AD HOC DATA MININGWe divide constraints into five categories:• Knowledge type constraints specify the type ofknowledge to be mined, such as concept descrip-tion, association, classification, prediction, clus-tering, or anomaly. This constraint, unlike otherconstraints, is usually specified at the beginningof a query because different types of knowledgecan require different constraints at later stages.• Data constraints specify the set of data relevant tothe mining task. We often specify such constraintsin a form similar to that of an SQL query andprocess them in query processing.• Dimension/level constraints confine the dimen-sion(s) or level(s) of data to be examined in a data-base or a data warehouse. Such constraints followthe model of a multidimensional database anddemonstrate the spirit of multidimensional min-ing. Thus, multidimensional mining can besmoothly incorporated within the framework ofconstraint-based mining.Integrating both constraint-based and multidimensional mining into oneframework provides an interactive, exploratory environment for effectiveand efficient data analysis and mining.Constraint-Based,Multidimensional Data MiningJiawei HanSimon FraserUniversityLaks V.S.LakshmananConcordiaUniversityand IndianInstitute ofTechnology,BombayRaymond T.NgUniversity ofBritishColombiaCover Feature• Rule constraints specify concrete constraints onthe rules to be mined.• Interestingness constraints specify what rangesof a measure associated with discovered patternsare useful or interesting from a statistical pointof view.The following example illustrates these five constraintsat work. Suppose there is a sales multidimensionaldatabase with four interrelated relations• sales (customer_name, item_name, transaction_id),• lives (customer_name, district, city),• item (item_name, category, price), and• transaction (transaction_id, day, month, year),where lives, item, and transaction are three dimensiontables. These tables are linked to the sales table viathree keys: customer_name, item_name, and transac-tion_id.“Find the sales of what cheap items (with the sumof the prices less than $100) that may promote thesales of what expensive items (with the minimum priceof $500) in the same category for Vancouver cus-tomers in 1998” is an association mining query. It isexpressed in a data mining query language (DMQL1)as shown in Figure 1a.This mining query may allow the generation ofassociation rules like those shown in Figure 1b.The rules mean that if a customer in Vancouverbought Census_CD and MS Office 97, there is a 68percent probability that he will also buy MS SQLServer. The rule further indicates that 1.5 percent of allthe customers fulfilled all the criteria.In this query, the knowledge type constraint is asso-ciation. The data constraint is lives(C, _ ,“Vancouver”).The dimensions are related to all three dimensions:lives, item, and transaction because the query involvesall of them. The levels are more confined. For lives, we only con-sider customer_name since city = “Vancouver” is usedonly in the selection; for item, we consider the levelsitem_name and category since they are used in thequery; and for transaction, we consider only transac-tion_id since day and month are not referenced andyear is used only in the selection. Rule constraintsinclude most portions of the where and having clauses,such as S.year = 1998, T.year = 1998, I.category = J.cat-egory, sum(I.price) $100, and min(J.price) ≥ 500.Finally, there are two interestingness constraints (thresh-olds), min_support = 0.01 and min_confidence = 0.5.Knowledge type constraints and data constraintscan be applied before data mining. That is, they arenot intertwined with the mining process itself. Afterapplying these constraints, a mining process may firstmine all of the possible rules before applying theremaining three categories of constraints and
View Full Document