BU CS 105 - Data Mining III - Numeric Estimation - D245197

Home> Schools> Boston University> Computer Science (CS) > CS 105> Data Mining III - Numeric Estimation

DOC PREVIEW

BU CS 105 - Data Mining III - Numeric Estimation

School name Boston University

Course Cs 105- Databases

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Data Mining III: Numeric EstimationComputer Science 105Boston UniversityFall 2014David G. Sullivan, Ph.D.Review: Numeric Estimation• Numeric estimation is like classification learning. • it involves learning a model that works like this:• the model is learned from a set of training examplesthat include the output attribute• In numeric estimation, the output attribute is numeric.• we want to be able to estimate its valuemodelinput attributesoutput attribute/classExample Problem: CPU Performance• We want to predict how well a CPU will perform on some task, given the following info. about the CPU and the task:• CTIME: the processor's cycle time (in nanosec)• MMIN: minimum amount of main memory used (in KB)• MMAX: maximum amount of main memory used (in KB)• CACHE: cache size (in KB)• CHMIN: minimum number of CPU channels used• CHMAX: maximum number of CPU channels used• We need a model that will estimate a performance score fora given combination of values for these attributes.modelCTIMEMMIN, MMAXCACHECHMIN, CHMAXperformance(PERF)Example Problem: CPU Performance (cont.)• The data was originally published in a 1987 article in the Communications of the ACM by Phillip Ein-Dor and Jacob Feldmesser of Tel-Aviv University.• There are 209 training examples. Here are five of them:CTIME MMIN MMAX CACHE CHMIN CHMAX PERF125 256 6000 256 16 128 19829 8000 32000 32 8 32 26929 8000 32000 32 8 32 172125 2000 8000 0 2 14 52480 512 8000 32 0 0 67class/output attributeinput attributes:Linear Regression• The classic approach to numeric estimation is linear regression.• It produces a model that is a linear function (i.e., a weighted sum) of the input attributes.• example for the CPU data:PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX + 0.4945CACHE – 0.1723CHMIN + 1.2012CHMAX – 66.48• this type of model is known as a regression equation• The general format of a linear regression equation is:y = w1x1+ w2x2+ … + wnxn+ cwherey is the output attributex1, … , xnare the input attributesw1, … , wnare numeric weightsc is an additional numeric constantlinear regressionlearns these valuesLinear Regression (cont.)• Once the regression equation is learned, it can estimate the output attribute for previously unseen instances.• example: to estimate CPU performance for the instancewe plug the attribute values into the regression equation:PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX + 0.4945CACHE – 0.1723CHMIN + 1.2012CHMAX – 66.48= 0.066 * 480480480480 + 0.0143 * 1000100010001000 + 0.0066 * 4000400040004000 + 0.4945 * 0000 – 0.1723 * 0000 + 1.2012 * 0000 – 66.48= 5.9CTIME MMIN MMAX CACHE CHMIN CHMAX PERF480 1000 4000 0 0 0 ?Linear Regression with One Input Attribute• Linear regression is easier to understand when there's only one input attribute, x1. • In that case:• the training examples are ordered pairs of the form (x1, y)• shown as points in the graph above• the regression equation has the form y = w1x1+ c• shown as the line in the graph above• w1 is the slope of the line; c is the y-intercept• Linear regression finds the line that "best fits" the training examples.yx1c• The dotted vertical bars show the differences between:• the actual y values (the ones from the training examples)• the estimated y values (the ones given by the equation)Why do these differences exist?• Linear regression finds the parameter values (w1and c) that minimize the sum of the squares of these differences.Linear Regression with One Input Attribute (cont.)yx1cy = w1x1+ cLinear Regression with Multiple Input Attributes• When there are k input attributes, linear regression finds the equation of a line in (k+1) dimensions. • here again, it is the line that "best fits" the training examples• The equation has the form we mentioned earlier:y = w1x1+ w2x2+ … + wnxn+ c • Here again, linear regression finds the parameter values(for the weights w1, … , wnand constant c) that minimize the sum of the squares of the differences between the actual and predicted y values.Linear Regression in Weka• Use the Classify tab in the Weka Explorer.• Click the Choose button to change the algorithm.• linear regression is in the folder labelled functions• By default, Weka employs attribute selection, which meansit may not include all of the attributes in the regression equation.Linear Regression in Weka (cont.)• On the CPU dataset with M5 attribute selection, Weka learns the following equation:PERF = 0.0661CTIME + 0.0142MMIN + 0.0066MMAX + 0.4871CACHE + 1.1868CHMAX – 66.60• it does not include the CHMIN attribute• To eliminate attribute selection, you can click on the nameof the algorithm and change the attributeSelectionMethodparameter to "No attribute selection".• doing so produces our earlier equation:PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX + 0.4945CACHE – 0.1723CHMIN + 1.2012CHMAX – 66.48• Notes about the coefficients:• what do the signs of the attributes mean?• what about their magnitudes?Evaluating a Regression Equation• To evaluate the goodness of a regression equation, we again set aside some of the examples for testing.• do not use these examples when learning the equation• use the equation on the test examples and see how well it does• Weka provides a variety of error measures, which are based onthe differences between the actual and estimated y values.• we want to minimize them• The correlation coefficient measures the degree of correlation between the actual and estimated y values.• it is between 0.0 and 1.0• we want to maximize itSimple Linear Regression• This algorithm in Weka creates a regression equation that uses only one of the input attributes.• even when there are multiple inputs• Like 1R, simple linear regression can serve as a baseline.• compare the models from more complex algorithms to the model it produces• It also gives insight into which of the inputs has the largest impact on the output.Handling Non-Numeric Input Attributes• We employ numeric estimation when the output attributeis numeric.• Some algorithms for numeric estimation also require that the input attributes be numeric.• If we have a non-numeric input attribute, it may be possibleto convert it to a numeric one.• example: if we have a binary attribute (yes/no or true/false), we can convert the two values to 0 and 1• In Weka, many algorithms – including linear regression – will automatically adapt to non-numeric

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

BU CS 105 - Data Mining III - Numeric Estimation

Sign up for free to view:

Please select your school