SBU CSE 634 - SLIQ - fast scalable classifier

Unformatted text preview:

Decision Trees SLIQ – fast scalable classifier Group 12 -Vaibhav Chopda -Tarun BahadurSlide 2Slide 3AgendaClassification Process : Model ConstructionTesting and Prediction (by a classifier)Classification by Decision Tree InductionSlide 8Slide 9Basic Idea of ID3/C4.5 AlgorithmBasic Idea of ID3/C4.5 Algorithm (2)Example from professor Anita’s slideShortcommings of ID3SLIQ - a decision tree classifierSLIQ Methodology:Example:Attribute listing phase :Presorting Phase:Constructing the decision treeSlide 20Gini IndexGini Index : The preferred splitting indexSlide 23Numeric Attributes splitting indexSplitting for catergorical attributesDetermining subset of highest indexThe decision tree getting constructed (level 0)Decision tree (level 1)The classification at level 1Performance:Performance: Classification AccuracyPerformance: Decision Tree SizePerformance: Execution timePerformance: ScalabilityConclusion:THANK YOU !!!Decision Trees SLIQ – fast scalable classifierGroup 12-Vaibhav Chopda-Tarun BahadurPaper By - Manish Mehta, Rakesh Agarwal and Jorma RissanenSource – http://citeseer.ifi.unizh.ch/mehta96sliq.htmlMaterial Includes: lecture notes for CSE634 – Prof. Anita Wasilewska http://www.cs.sunysb.edu/~cse634Agenda What is classification … Why decision trees ?The ID3 algorithmLimitations of ID3 algorithmSLIQ – fast scalable classifier for DataMiningSPRINT – the successor of SLIQClassification Process : Model ConstructionTrainingDataNAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 noClassificationAlgorithmsIF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ Classifier(Model)CSE634 course notes – Prof. Anita WasilewskaTesting and Prediction (by a classifier)ClassifierTestingDataNAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yesUnseen Data(Jeff, Professor, 4)Tenured?CSE634 course notes – Prof. Anita WasilewskaClassification by Decision Tree Induction•Decision tree (Tuples flow along the tree structure)–Internal node denotes an attribute–Branch represents the values of the node attribute–Leaf nodes represent class labels or class distributionCSE634 course notes – Prof. Anita WasilewskaClassification by Decision Tree Induction•Decision tree generation consists of two phases–Tree construction•At start we choose one attribute as the root and put all its values as branches•We choose recursively internal nodes (attributes) with their proper values as branches.•We Stop when –all the samples (records) are of the same class, then the node becomes the leaf labeled with that class –or there is no more samples left –or there is no more new attributes to be put as the nodes. In this case we apply MAJORITY VOTING to classify the node.–Tree pruning•Identify and remove branches that reflect noise or outliers CSE634 course notes – Prof. Anita WasilewskaClassification by Decision Tree Induction–Wheres the challenge ?–Good choice of root attribute –Good choice of the internal nodes attributes is a crucial point. •Decision Tree Induction Algorithms differ on methods of evaluating and choosing the root and internal nodes attributes.CSE634 course notes – Prof. Anita WasilewskaBasic Idea of ID3/C4.5 Algorithm- greedy algorithm- constructs decision trees in a top-down recursive divide-and-conquer manner.•Tree STARTS as a single node (root) representing all training dataset (samples)•IF the samples are ALL in the same class, THEN the node becomes a LEAF and is labeled with that class•OTHERWISE, the algorithm uses an entropy-based measure known as information gain as a heuristic for selecting the ATTRIBUTE that will BEST separate the samples into individual classes. This attribute becomes the node-name (test, or tree split decision attribute)CSE634 course notes – Prof. Anita WasilewskaBasic Idea of ID3/C4.5 Algorithm (2)•A branch is created for each value of the node-attribute (and is labeled by this value -this is syntax) and the samples are partitioned accordingly (this is semantics; see example which follows)•The algorithm uses the same process recursively to form a decision tree at each partition. Once an attribute has occurred at a node, it need not be considered in any other of the node’s descendents•The recursive partitioning STOPS only when any one of the following conditions is true.•All records (samples) for the given node belong to the same class or•There are no remaining attributes on which the •Records (samples) may be further partitioning. In this case we convert the given node into a LEAF and label it with the class in majority among samples (majority voting)•There is no records (samples) left – a leaf is created with majority vote for training sampleCSE634 course notes – Prof. Anita WasilewskaExample from professor Anita’s slideThis follows an example from Quinlan’s ID3CSE634 course notes – Prof. Anita WasilewskaShortcommings of ID3Scalability ? requires lot of computation at every stage of construction of decision treeScalability ? needs all the training data to be in the memory It does not suggest any standard splitting index for range attributesSLIQ - a decision tree classifierFeatures of SLIQApplies to both numerical and categorical attributesBuilds compact and accurate treesUses a pre-sorting technique in the tree growing phase and an inexpensive pruning algorithmSuitable for classification of large disk-resident datasets, independently of the number of classes, attributes and recordsSLIQ Methodology:Generate attribute list for each attributeSort attribute lists for NUMERIC AttributesCreate decision tree by partitioning recordsStart EndExample:Drivers Age CarType Class23 Family HIGH17 Sports HIGH43 Sports HIGH68 Family LOW32 Truck LOW20 Family HIGHAttribute listing phase :Age Class RecId23 HIGH 017 HIGH 143 HIGH 268 LOW 332 LOW 420 HIGH 5Rec Id Age CarType Class0 23 Family HIGH1 17 Sports HIGH2 43 Sports HIGH3 68 Family LOW4 32 Truck LOW5 20 Family HIGHRec Id CarType Rec IdFamily HIGH 0Sports HIGH 1Sports HIGH 2Family LOW 3Truck LOW 4Family HIGH 5Age – NUMERIC attribute CarType – CATEGORICAL attributePresorting Phase:Age Class RecId17 HIGH 020 HIGH 523 HIGH 032 LOW 443 LOW 268 HIGH 3CarType Class Rec IdFamily HIGH 0Sports HIGH 1Sports HIGH


View Full Document
Download SLIQ - fast scalable classifier
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view SLIQ - fast scalable classifier and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view SLIQ - fast scalable classifier 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?