UB CSE 574 - Census Data - D1840550

Home> Schools> University at Buffalo, The State University of New York> Computer Science & Engineering (CSE) > CSE 574> Census Data

DOC PREVIEW

UB CSE 574 - Census Data

School name University at Buffalo, The State University of New York

Course Cse 574- Introduction to Machine Learning

Pages 52

This preview shows page 1-2-3-25-26-27-28-50-51-52 out of 52 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Census DataPredicting Gender from CensusMPG Data Set ( 40 records from UCI repository)Example with MPGA Decision StumpBase Case 2: Don’t split if no attribute is usefulTest Set generated by the same methodOver-fitting the DataOver-fitting the DataEffect of Noise in training data: cause over-fitOver-fitting with ID3Two Approaches To Prevent Over-fittingCriterion to Determine Correct Tree SizeTraining and Validation Set ApproachValidation SetHow to use validation set to PruneEffect of PruningReduced Error Pruning PropertiesRule Post-PruningExample for Rule Post-pruningWhy Convert to rules before pruning?Rule Post-Pruning: Four stepsExample for Rule Post-pruningRule Post-Pruning: Four stepsExample for Rule Post-pruningPruned RulesEstimation of Rule AccuracyChi Squared Approach to Avoid OverfittingWhat is a chi squared test?Handling Training Examples with Missing Attribute ValuesHandling Attributes with Differing Costs (3.7.5)Inductive Bias in ID3Inductive Bias in Decision Tree LearningRestriction Biases and Preference BiasesWhy Prefer Short Hypotheses?SummarySummary, continued1Census Data2Predicting Wealth from Census3Predicting Age from Census4Predicting Gender from Census5MPG Data Set ( 40 records from UCI repository)6• Predicting MPG• Has good/bad values• Look at all the Information Gains• Cylinders has the highest IG7Example with MPG8A Decision Stump91011Base Case 2: Don’t split if no attribute is useful1213141516Training Set error = 017Test Set generated by the same method18192021Over-fitting the Data• ID3 grows each branch of tree just deeply enough to perfectly classify training samples• Overfitting the data• when there is noise in the data or • when the number of training examples is too small to produce a representative sample of the true target function.22Over-fitting the DataDefinition: Given a hypothesis space H, a hypothesis h∈H is said to overfit the training data if there exists some alternative hypothesis h′∈H, such that h has smaller error than h′over the training examples, but h′has a smaller error than h over the entire distribution of instances.23Effect of Noise in training data: cause over-fit• Consider effect of adding incorrectly labeled sample: (Outlook=Sunny,Temperature=Hot,Humidity=Normal,Wind=Strong,PlayTennis=No)ID3 will expand this node further and find a tree h more complex than h’ which underperforms24Over-fitting with ID3Learning which medical patients have a form of diabetesMonotonic increaseDecreased performanceafter tree size exceeds 25 nodesNoise in training data can cause over-fit25Two Approaches To Prevent Over-fitting• 1. Stop growing tree earlier, before it reaching point where it perfectly classifies training data.• 2. Allow tree to over-fit the data and then post-prune the tree.• Although first approach is more direct, second approach found more successful in practice: because difficult to estimate when to stop• Both need a criterion to determine final tree size26Criterion to Determine Correct Tree Size• 1. Training and Validation Set Approach: • Use a separate set of examples, distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree.• 2. Use all available data for training, • but apply a statistical test (Chi-square test) to estimate whether expanding (or pruning) a particular node is likely to produce an improvement.• 3. Use an explicit measure of the complexity• for encoding the training examples and the decision tree,• halting growth when this encoding size is minimized.27Training and Validation Set Approach• Training Set : used to form learned hypotheses• Validation Set: • used to evaluate the accuracy of this hypothesis over subsequent data• also, evaluate impact of pruning hypothesis• Philosophy: • Validation set is unlikely to exhibit same random fluctuations as Training set• check against over-fitting28Validation Set• Provides a safety check against overfitting spurious characteristics of data• Needs to be large enough to provide a statistically significant sample of instances• Typically validation set is one half size of training set29How to use validation set to Prune • Consider each node of the decision nodes in the tree to be candidates for pruning• Pruning a decision tree consists of • removing a sub-tree rooted at the node• making it a leaf node• assigning it the most common classification of the training examples affiliated with that node.• Reduced Error Pruning: Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set.30Effect of PruningAccuracy increases over test set as nodes are pruned from tree31Reduced Error Pruning Properties• When pruning begins tree is at maximum size and lowest accuracy over test set• As pruning proceeds no of nodes is reduced and accuracy over test set increases• Disadvantage: when data is limited, no of samples available for training is further reduced• Rule post-pruning is one approach (discussed next)• Alternatively, partition available data several times in multiple ways and then average the results32Rule Post-Pruning• Useful when data is limited• practical method for finding high accuracy hypotheses• variant of rule post-pruning is used by C4.5• C4.5 system is an outgrowth of ID3 algorithm• C4.5 also allows dealing with numerical attributes, missing values, noisy data33Example for Rule Post-pruningStep 1: Learn TreeStep 2: Convert tree to equivalent rules: generate one rule for each leaf nodeLeftmost Path: IF (Outlook= Sunny) ^ (Humidity = High)THEN Play-Tennis = no34Why Convert to rules before pruning?• Converting to rules allows distinguishing among the different contexts in which a decision node is used.• Converting to rules removes the distinction between attribute tests that occur near the root of the tree and those that occur near the leaves.• Converting to rules improves readability.35Rule Post-Pruning: Four steps1 Infer the decision tree from the training set, growing the tree until the training data fits as well as possible and allowing the overfitting to occur.2 Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to the leaf node.36Example for Rule Post-pruningStep 1: Learn TreeStep 2: Convert tree to equivalent rules: generate one rule for each leaf nodeLeftmost Path: IF (Outlook= Sunny) ^ (Humidity = High)THEN

View Full Document