Decision Tree ClassificationPapersOutlineClassification ProblemA Training setClassification ModelsWhy Decision Tree Model?A Decision TreeSlide 9Tree Building PhaseTree Building Phase (cont.)Splitting IndexThe Best SplitTree Pruning PhaseSLIQ - OverviewData StructureAn illustration of the Data StructurePre-sortingAfter Pre-sortingNode SplitClass HistogramEvaluate SplitSubsetting for Categorical AttributesPartition the dataExample of Evaluating SplitsExample of Updating Class ListMDL PrincipleMDL Pruning AlgorithmEncoding SchemePerformance (Scalability)SPRINT - OverviewData Structure – Attribute ListAn Example of Attribute ListsAttribute Lists after SplittingData Structure - HistogramFinding Split PointsEvaluate numeric attributesEvaluate categorical attributesPerforming the SplitPerforming the Split (cont.)Slide 41Parallelizing ClassificationParallel Data PlacementSlide 44Example of Histograms in Parallel ClassificationPerforming the SplitsSLIQ vs. SPRINTData StreamsIssuesIncremental learning methodsHoeffding Tree AlgorithmHoeffding BoundHoeffding Bound (cont.)Slide 54PowerPoint PresentationVFDT (Very Fast Decision Tree learner)Performance – ExamplesPerformance – NodesPerformance – Noise dataConclusion1Decision Tree ClassificationTomi YiuCS 632 — Advanced Database Systems April 5, 20012PapersManish Mehta, Rakesh Agrawal, Jorma Rissanen: SLIQ: A Fast Scalable Classifier for Data Mining. John C. Shafer, Rakesh Agrawal, Manish Mehta: SPRINT: A Scalable Parallel Classifier for Data Mining. Pedro Domingos, Geoff Hulten: Mining high-speed data streams.3OutlineClassification problemGeneral decision tree modelDecision tree classifiersSLIQSPRINTVFDT (Hoeffding Tree Algorithm)4Classification ProblemGiven a set of example recordsEach record consists of A set of attributesA class labelBuild an accurate model for each class based on the set of attributesUse the model to classify future data for which the class labels are unknown5A Training setAge Car Type Risk23 Family High17 Sports High43 Sports High68 Family Low32 Truck Low20 Family High6Classification ModelsNeural networks Statistical models – linear/quadratic discriminantsDecision treesGenetic models7Why Decision Tree Model?Relatively fast compared to other classification modelsObtain similar and sometimes better accuracy compared to other modelsSimple and easy to understandCan be converted into simple and easy to understand classification rules8A Decision TreeAge < 25Car Type in {sports}HighHigh Low9Decision Tree ClassificationA decision tree is created in two phases:Tree Building PhaseRepeatedly partition the training data until all the examples in each partition belong to one class or the partition is sufficiently smallTree Pruning Phase Remove dependency on statistical noise or variation that may be particular only to the training set10Tree Building PhaseGeneral tree-growth algorithm (binary tree)Partition(Data S)If (all points in S are of the same class) then return;for each attribute A doevaluate splits on attribute A;Use best split to partition S into S1 and S2;Partition(S1);Partition(S2);11Tree Building Phase (cont.)The form of the split depends on the type of the attributeSplits for numeric attributes are of the form A v, where v is a real numberSplits for categorical attributes are of the form A S’, where S’ is a subset of all possible values of A12Splitting IndexAlternative splits for an attribute are compared using a splitting indexExamples of splitting index:Entropy ( entropy(T) = - pj x log2(pj) )Gini Index ( gini(T) = 1 - pj2 )(pj is the relative frequency of class j in T)13The Best SplitSuppose the splitting index is I(), and a split partitions S into S1 and S2The best split is the split that maximizes the following value:I(S) - |S1|/|S| x I(S1) + |S2|/|S| x I(S2)14Tree Pruning PhaseExamine the initial tree builtChoose the subtree with the least estimated error rateTwo approaches for error estimation:Use the original training dataset (e.g. cross –validation)Use an independent dataset15SLIQ - OverviewCapable of classifying disk-resident datasetsScalable for large datasetsUse pre-sorting technique to reduce the cost of evaluating numeric attributesUse a breath-first tree growing strategyUse an inexpensive tree-pruning algorithm based on the Minimum Description Length (MDL) principle16Data StructureA list (class list) for the class label Each entry has two fields: the class label and a reference to a leaf node of the decision treeMemory-residentA list for each attributeEach entry has two fields: the attribute value, an index into the class listWritten to disk if necessary17An illustration of the Data StructureAge Class List IndexCar TypeClass List IndexClass Leaf23 1 Family 1 1 High N117 2 Sports 2 2 High N143 3 Sports 3 3 High N168 4 Family 4 4 Low N132 5 Truck 5 5 Low N120 6 Family 6 6 High N118Pre-sortingSorting of data is required to find the split for numeric attributes Previous algorithms sort data at every node in the treeUsing the separate list data structure, SLIQ only sort data once at the beginning of the tree building phase19After Pre-sortingAge Class List IndexCar TypeClass List IndexClass Leaf17 2 Family 1 1 High N120 6 Sports 2 2 High N123 1 Sports 3 3 High N132 5 Family 4 4 Low N143 3 Truck 5 5 Low N168 4 Family 6 6 High N120Node SplitSLIQ uses a breath-first tree growing strategyIn one pass over the data, splits for all the leaves of the current tree can be evaluatedSLIQ uses gini-splitting index to evaluate split Frequency distribution of class values in data partitions is required21Class HistogramA class histogram is used to keep the frequency distribution of class values for each attribute in each leaf nodeFor numeric attributes, the class histogram is a list of <class, frequency>For categorical attributes, the class histogram is a list of <attribute value, class, frequency>22Evaluate Splitfor each attribute Atraverse attribute list of Afor each value v in the attribute listfind the corresponding class and leaf nodeupdate the class histogram in the leaf lif A is a numeric attribute then compute splitting index for test (Av) for leaf lif A is a categorical attribute thenfor each leaf of the tree do find subset of A with the best split23Subsetting for Categorical AttributesIf
View Full Document