DOC PREVIEW
MIT 16 412J - Study Guide

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

JMLR: Workshop and Conference Proceedings 10: 22-34 The Fourth Workshop on Feature Selection in Data MiningA Statistical Implicative Analysis Based Algorithm andMMPC Algorithm for Detecting Multiple DependenciesElham Salehi [email protected] Nyayachavadi [email protected] Gras [email protected] of Computer ScienceUniversity of WindsorWindsor, Ontario, N9B 3P4Editor: Huan Liu, Hiroshi Motoda, Rudy Setiono, and Zheng ZhaoAbstractDiscovering the dependencies among the variables of a domain from examples is an impor-tant problem in optimization. Many methods have been proposed for this purpose, but fewlarge-scale evaluations were conducted. Most of these methods are based on measurementsof conditional probability. The statistical implicative analysis offers another perspective ofdependencies. It is important to compare the results obtained using this approach withone of the best methods currently available for this task: the MMPC heuristic. As the SIAis not used directly to address this problem, we designed an extension of it for our purpose.We conducted a large number of experiments by varying parameters such as the number ofdependencies, the number of variables involved or the type of their distribution to comparethe two approaches. The results show strong complementarities of the two methods.Keywords: Statistical Implicative Analysis, multiple dependencies, Bayesian network.1. IntroductionThere are many situations in which finding the dependencies among the variables of a do-main is needed. Therefore having a model describing these dependencies provides significantinformation. For example, which variable(s) affect(s) the other variable(s) may be very use-ful for the problem of selection of variables; decomposition of a problem to independentsub-problems; predicting the value of a variable depending on other variables to solve theclassification problem; finding an instantiation of a set of variables for maximizing the valueof some function, etc (A. Goldebberg, 2004; Y. Zeng, 2008).The classical model used for the detection of dependencies is the Bayesian network.This network is a factorization of the probability distribution of a set of examples. It is wellknown that the construction of a Bayesian network from examples is a NP-hard problem,thus different heuristic algorithms have been designed to to solve this problem (Neapolitan,2003; E. Saheli, 2009) . Most of these heuristics are greedy and/or try to reduce the size ofthe exponential search space by a filtering strategy. The filtering is based on some measuresthat aim to discover sets of variables that have high potentiality to be mutually dependentor independent.c2010 Salehi, Nyayachavadi and Gras.Detecting Multiple DependenciesThese measures rely on an evaluation of the degree of conditional independency. How-ever other measures exist which are not based on conditional probability measurements thathave the ability to discover dependencies. Using another measure that is not based on con-ditional dependencies can provide another perspective about the structure of dependenciesof variables of a domain. Statistical Implicative Analysis (SIA) has already shown a greatcapability in extracting quasi-implications also called as association rules (R. Gras, 2008).We present a measure for multiple dependencies based on SIA and then use this measurein a greedy algorithm for solving the problem of multiple dependencies detection. We havecompared our new algorithm for finding dependencies with one of the most successful con-ditional dependencies based heuristic introduced so far, MMPC (I. Tsamardinos, 2006). Wehave designed a set of experiments to evaluate the capacity of each of them to discover twokinds of knowledge: the fact that one variable conditionally depends on another one andthe sets of variables that are involved in a conditional dependencies relation. Both of thisinformation can be used to decompose the NP-hard problem of finding the structure of aBayesian network into independent sub-problems and therefore can reduce considerably thesize of corresponding search space.This paper organized as follows: In the next section we describe the MMPC heuristic. Insection 3 we present our SIA based measure and algorithm for finding multiple dependenciesand the experimental results of the algorithms are presented in Section 4. Finally weconclude in section 5 with a brief discussion.2. The MMHC HeuristicDiscovering multiple dependencies from a set of examples is a difficult problem. It is clearthat this problem cannot be solved exactly when the number of variables approaches fewdozens . However, for some problems, the number of variables can be several hundred orseveral thousand. Therefore, it is particularly important to have some methods to obtainan approximate solution with good quality. A local search approach is usually used inthese problems. In this case the model of dependencies is built incrementally by addingor removing one or more dependencies at each step. The dependencies are chosen to beadded or removed using a score that assesses the quality of the new model according tothe set of examples (E. Saheli, 2009). In this approach the search space is exponential interms of maximum number of variables on which a variable may depend. Therefore, thereis a need to develop methods to increase the chances of building a good quality modelwithout exploring the whole search space exhaustively. One possible approach is to use aless computationally expensive method to determine a promising subset of the search spaceon which we can subsequently apply a more systematic and costly method.The final model is usually a Bayesian network in which the dependencies represent con-ditional independencies among variables. It is possible to build this model using informationfrom other measures besides conditional probability. Indeed, the measurements in the firstphase are used as a filter to eliminate the independent variables or bring the variables withshared dependencies together in several sub-groups. The second phase uses this filteredinformation to build a Bayesian network. The goal of our study is to compare the abilityof two approaches for the detection of dependencies for the first phase. In this section a23Salehi, Nyayachavadi and Grasmeasure based on conditional probability is described and in the section 4 this measure willbe compared with a SIA based measure.2.1 Definition and NotationA Bayesian network is a tool to


View Full Document

MIT 16 412J - Study Guide

Documents in this Course
Load more
Download Study Guide
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Guide and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Guide 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?