New version page

K-State CIS 830 - Lecture 30

Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

PowerPoint PresentationSlide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Kansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial IntelligenceMonday, April 3, 2000DingBing YangDepartment of Plant Pathology, KSURead:“ Irrelevant Features and the Subset Selection Problem”George H,John; Ron Kohavi; Karl Pfleger Data Mining and KDD Presentation (2 of 4):Relevance Determination in KDDLecture 30Lecture 30Kansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial Intelligence•Objective–Finding a subset of features that allows a supervised induction algorithm to induce small high-accuracy concepts •Overview–Introduction –Relevance Definition –The Filter Model and The Wrapper model–Experimental results •References–Selection of Relevant Features and Examples in Machine Learning: Avrim L. Blum, Pat Langley . Artificial Intelligence 97(1997) 245-271–Wrappers for Feature Subset Selection: Ron Kohavi, Geoge H. John. Artificial Intelligence 97(1997) 273-324Presentation OutlinePresentation OutlineKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial Intelligence • Why find a good feature subset ?–Some learning algorithm degrade in performance ( prediction accuracy) when faced with many features that are not necessary for predicting the desired output. Decision tree algorithm : ID3, C4.5, CART; Instance-based algorithm : IBL–Some algorithm are robust with respect to irrelevant features, but their performance may degrade quickly if correlated features are added, even if the features are relevant Naïve-Bayes • An example–running C4.5, Dataset is Monk1, there are 3 irrelevant features.–The induced tree has 15 interior nodes, five of them test irrelevant features, the generated tree has an error rate of 24.3% –if only the relevant features are given, the error rate is reduced to 11.1%• What is a optimal feature subset? –Given an inducer I, and a dataset D with features X1,X2,… Xn, from a distribution D over the labeled instance space. An optimal feature subset is a subset of the features such that the accuracy of the induced classifier C=I(D) is maximal. IntroductionIntroductionKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial IntelligenceCorrelatedIrrelevantA1A0100A00 1101010 10101The tree induced by C4.5 for “Corral” dataset that has “correlated” features and irrelevant features Incorrect Induced Decision TreeIncorrect Induced Decision TreeKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial Intelligence–ID3 algorithm•It is a decision tree learning algorithm. It constructs decision tree top-down.•Compute the information gain of each instance attribute among the candidate attributes. Select the attribute that has maximum IG value as the test at the root node of the tree. •The entire process is then repeated using the training example associated with each descendant node.–C4.5 algorithm•It is a improvement over ID3. It is a rule post-pruning.•Infer the decision tree from the training set. Convert the learned tree into an equivalent set of rules.•Prune each rule by removing any precondition that result in improving its estimated accuracy.Background KnowledgeBackground KnowledgeKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial Intelligence–K-Nearest neighbor Learning •It is a instance-based learning. It just simply stores the training examples. Generalization beyond these examples is postponed until a new instance must be classified.•Each time a new query instance is encountered, its relation to the previous stored examples is examined.•The target function value for a new query is estimated from the known values of the k nearest training examples.–Minimum Description Length (MDL) Principle•Choosing the hypothesis that minimizes the description length of the hypothesis plus the description length of the data given the hypothesis.–Naïve Bayes classifier•It incorporates the simplifying assumption that attributes values are conditionally independent, given the classification of the instance.Background KnowledgeBackground KnowledgeKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial Intelligence Relevance Definition Relevance Definition •Assumption • a set of n training instances. training instances are tuple <X,Y>. X is an element of the set F1xF2x…xFm . Fi is the domain of the ith feature. Y is label.• Given an instance, the value of feature Xi is denoted by xi.• Assume a probability measure p on the space F1xF2x…xFm xY.• Si is the set of all features except Xi. Si={X1,…Xi-1,Xi+1,…,Xm}. • Strong relevance • Xi is strongly relevant iff there exists some xi, y and si for which p(Xi = xi, Si= si ) >0 such that p(Y=y | Si= si , Xi = xi)  p(Y=y | Si= si ) • Intuitive understanding: the strongly relevant feature can’t be removed without loss of prediction accuracy •Kansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial Intelligence Relevance DefinitionRelevance Definition•Weak Relevance• A feature Xi is weakly relevant iff it is not strongly relevant, and there exists a subset of features Si’ of Si for which there exists some xi, y and si for which p(Xi = xi, S’i= s’i ) >0 such that p(Y=y | S’i= s’i , Xi = xi)  p(Y=y | S’i= s’i )• Intuitive understanding: The weakly relevant feature can sometimes contribute to prediction accuracy.•Irrelevance• features are irrelevant if they are neither strongly nor weakly relevant.• Intuitive understanding: Irrelevant features can never contribute to prediction accuracy. • Example • Let features X1,…X5 be Boolean. X2= ¬X4 , X3=¬X5 . There are only eight possible instance, and we assume they are equiprobable. Y = X1 + X2• X1: strongly relevant; X2, X4: weakly relevant ; X3, X5: irrelevantKansas State UniversityDepartment of Computing and Information SciencesCIS 830:


View Full Document
Download Lecture 30
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 30 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 30 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?