CORNELL CS 632 - Decision Tree Classification - D2318416

Home> Schools> Cornell University> Computer Science (CS) > CS 632> Decision Tree Classification

DOC PREVIEW

CORNELL CS 632 - Decision Tree Classification

School name Cornell University

Course Cs 632- Database Systems

Pages 60

This preview shows page 1-2-3-4-28-29-30-31-57-58-59-60 out of 60 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Decision Tree ClassificationPapersOutlineClassification ProblemA Training setClassification ModelsWhy Decision Tree Model?A Decision TreeSlide 9Tree Building PhaseTree Building Phase (cont.)Splitting IndexThe Best SplitTree Pruning PhaseSLIQ - OverviewData StructureAn illustration of the Data StructurePre-sortingAfter Pre-sortingNode SplitClass HistogramEvaluate SplitSubsetting for Categorical AttributesPartition the dataExample of Evaluating SplitsExample of Updating Class ListMDL PrincipleMDL Pruning AlgorithmEncoding SchemePerformance (Scalability)SPRINT - OverviewData Structure – Attribute ListAn Example of Attribute ListsAttribute Lists after SplittingData Structure - HistogramFinding Split PointsEvaluate numeric attributesEvaluate categorical attributesPerforming the SplitPerforming the Split (cont.)Slide 41Parallelizing ClassificationParallel Data PlacementSlide 44Example of Histograms in Parallel ClassificationPerforming the SplitsSLIQ vs. SPRINTData StreamsIssuesIncremental learning methodsHoeffding Tree AlgorithmHoeffding BoundHoeffding Bound (cont.)Slide 54PowerPoint PresentationVFDT (Very Fast Decision Tree learner)Performance – ExamplesPerformance – NodesPerformance – Noise dataConclusion1Decision Tree ClassificationTomi YiuCS 632 — Advanced Database Systems April 5, 20012PapersManish Mehta, Rakesh Agrawal, Jorma Rissanen: SLIQ: A Fast Scalable Classifier for Data Mining. John C. Shafer, Rakesh Agrawal, Manish Mehta: SPRINT: A Scalable Parallel Classifier for Data Mining. Pedro Domingos, Geoff Hulten: Mining high-speed data streams.3OutlineClassification problemGeneral decision tree modelDecision tree classifiersSLIQSPRINTVFDT (Hoeffding Tree Algorithm)4Classification ProblemGiven a set of example recordsEach record consists of A set of attributesA class labelBuild an accurate model for each class based on the set of attributesUse the model to classify future data for which the class labels are unknown5A Training setAge Car Type Risk23 Family High17 Sports High43 Sports High68 Family Low32 Truck Low20 Family High6Classification ModelsNeural networks Statistical models – linear/quadratic discriminantsDecision treesGenetic models7Why Decision Tree Model?Relatively fast compared to other classification modelsObtain similar and sometimes better accuracy compared to other modelsSimple and easy to understandCan be converted into simple and easy to understand classification rules8A Decision TreeAge < 25Car Type in {sports}HighHigh Low9Decision Tree ClassificationA decision tree is created in two phases:Tree Building PhaseRepeatedly partition the training data until all the examples in each partition belong to one class or the partition is sufficiently smallTree Pruning Phase Remove dependency on statistical noise or variation that may be particular only to the training set10Tree Building PhaseGeneral tree-growth algorithm (binary tree)Partition(Data S)If (all points in S are of the same class) then return;for each attribute A doevaluate splits on attribute A;Use best split to partition S into S1 and S2;Partition(S1);Partition(S2);11Tree Building Phase (cont.)The form of the split depends on the type of the attributeSplits for numeric attributes are of the form A  v, where v is a real numberSplits for categorical attributes are of the form A  S’, where S’ is a subset of all possible values of A12Splitting IndexAlternative splits for an attribute are compared using a splitting indexExamples of splitting index:Entropy ( entropy(T) = - pj x log2(pj) )Gini Index ( gini(T) = 1 - pj2 )(pj is the relative frequency of class j in T)13The Best SplitSuppose the splitting index is I(), and a split partitions S into S1 and S2The best split is the split that maximizes the following value:I(S) - |S1|/|S| x I(S1) + |S2|/|S| x I(S2)14Tree Pruning PhaseExamine the initial tree builtChoose the subtree with the least estimated error rateTwo approaches for error estimation:Use the original training dataset (e.g. cross –validation)Use an independent dataset15SLIQ - OverviewCapable of classifying disk-resident datasetsScalable for large datasetsUse pre-sorting technique to reduce the cost of evaluating numeric attributesUse a breath-first tree growing strategyUse an inexpensive tree-pruning algorithm based on the Minimum Description Length (MDL) principle16Data StructureA list (class list) for the class label Each entry has two fields: the class label and a reference to a leaf node of the decision treeMemory-residentA list for each attributeEach entry has two fields: the attribute value, an index into the class listWritten to disk if necessary17An illustration of the Data StructureAge Class List IndexCar TypeClass List IndexClass Leaf23 1 Family 1 1 High N117 2 Sports 2 2 High N143 3 Sports 3 3 High N168 4 Family 4 4 Low N132 5 Truck 5 5 Low N120 6 Family 6 6 High N118Pre-sortingSorting of data is required to find the split for numeric attributes Previous algorithms sort data at every node in the treeUsing the separate list data structure, SLIQ only sort data once at the beginning of the tree building phase19After Pre-sortingAge Class List IndexCar TypeClass List IndexClass Leaf17 2 Family 1 1 High N120 6 Sports 2 2 High N123 1 Sports 3 3 High N132 5 Family 4 4 Low N143 3 Truck 5 5 Low N168 4 Family 6 6 High N120Node SplitSLIQ uses a breath-first tree growing strategyIn one pass over the data, splits for all the leaves of the current tree can be evaluatedSLIQ uses gini-splitting index to evaluate split Frequency distribution of class values in data partitions is required21Class HistogramA class histogram is used to keep the frequency distribution of class values for each attribute in each leaf nodeFor numeric attributes, the class histogram is a list of <class, frequency>For categorical attributes, the class histogram is a list of <attribute value, class, frequency>22Evaluate Splitfor each attribute Atraverse attribute list of Afor each value v in the attribute listfind the corresponding class and leaf nodeupdate the class histogram in the leaf lif A is a numeric attribute then compute splitting index for test (Av) for leaf lif A is a categorical attribute thenfor each leaf of the tree do find subset of A with the best split23Subsetting for Categorical AttributesIf

View Full Document