Unformatted text preview:

EXST7034 - Regression Techniques Page 1Regression diagnostics dependent variable Y3 There are a number of graphic representations which will help with problem detection and which can be used to obtain a better understanding of the datasetavailable. a) Box plot available in SAS PROC UNIVARIATE this plot shows the Quartiles : Maximum (100 percentile) and Minimum (0>2percentile) values, and the 1 quartile (25 percentile), 2 (50=> >2 8. >2percentile or the median) and 3 quartile (or 75 percentile).<. >21008070605040302010090Maximum (4th Quartile)3rd QuartileMedian (2nd Quartile)1st Quartile)Minimum These statistics are “non-parametric" in that they are not influenced by thedistribution, but they will help get a feel for the distribution. 1) if the median is centered, and the box centered between the maximumand minimum, then the data is symmetric 2) if the median is NOT centered, this indicates a skew in the data 3) if the median is centered in the box, and the box IS NOT centeredbetween the maximum and minimum, this may indicate an outlier 4) if neither the median nor the box is centered, this is a pretty goodindicator of skewness.EXST7034 - Regression Techniques Page 2Example from Freund and Wilson - Tree Weight on Length done as linear Boxplot 0 beyond 1.5 interquartile distances | 1.5 interquartiles above third quartile | | | +-----+ third quartile | + | mean *-----* median or second quartile +-----+ first quartile | | | | minimumThe interquartile distance is the distance from the 25th to the 75th percentile.The wiskers (vertical bars) extend out 1.5 interquartile distances from thequartiles.Outside the wiskers, values are repsented with 0 out to 3 interquartile distancesBeyond 3 interquartile distances values are repsented with asterisksEXST7034 - Regression Techniques Page 3b) Time plot this plot could be obtained in SAS with PROC PLOT, but someorder variable (eg OBS) is requiredwhere applicable, this can be a useful indicator of variation which may not beaccounted for by the regression line. The order in which the data wasacquired may also be useful as an indicator. 1) very often there is some effect of “time" in the model. The time plot willaid in the determination of such and effect. 2) The data may not have been gathered at random, and some aspects of thiscan also be detected. 3) If the data must be gathered over time, try to randomize the dependentvariable X when this can be controlled over time to avoid confounding.3 for example, we wish to measure the amount of lactose in a fish's blood, andregress this on the amount of time he spent swimming in a 1 foot/seccurrent created in an artificial stream. The dependent variable is “time",but not in the sense of a time plot. The order here may be important, aslactose may change if there are slight shifting of the current speed overtime, or a buildup of metabolic wastes in the water which could effectlactose levels. Don't do all the short times first, and all the long timeslater.EXST7034 - Regression Techniques Page 4c) Stem and leaf plot available in SAS PROC UNIVARIATEThis plot is useful to give an extra dimension to the information obtained from theBox plot. 1) as with the Box plot, we get an additional idea of symmetry 2) This plot will also indicate bimodality or polymodality, which the boxplot will not. d) Dot plot similar to a histogram in SAS, PROC CHART This plot is similar to a stem and leaf plot, plotted horizontally instead ofvertically. 1) as with the Box plot, we get an additional idea of symmetry 2) This plot will also indicate bimodality or polymodality, which the boxplot Direct observation of Y is frequently not a useful undertaking. Y is assumed to33be normally distributed at EACH value of X . Even perfectly normally3distributed data could appear polymodal or asymmetric if the data istaken as widely separate X values or if most of the data is taken a high3or low values of X .3Observations of e are generally more useful. Y alone can be misleading.33EXST7034 - Regression Techniques Page 5Residual Analysis and RESIDUAL PLOTS  help in determining if the ASSUMPTIONS are met, and if the model iscorrect Given the term population residuals, = Y - E(Y)%33 3 which are assumed to be NIDrv(0, ).52These are then estimated by the observed deviations, e = Y - Y^333 where we know that e 0 since e 0–33œœ œDen3D and we define MSEDD(e e) (e )–n2 n2 n2SSE33œœœ where, E(MSE) œ 52 Note that the residuals measure the actual deviation of the point from theregression line, and as a result will have the same units as the variable Y .3For example, if we regress vial breakage on transfers, then a residual of 2would be 2 vials.For some examinations, residuals are standardized to a “Z" distribution. Since theresiduals are assumed to be normal, we know that (for large samples) about65% of the standardized residuals would fall between -1 and 1, about 95%would fall between -1.96 and 1.96, and about 99% would fall between -2.576 and 2.576.If we standardize the residuals, we can examine a residual and see if it appears tobe “unusually large" or within the usual bounds.Standardized residuals are calculated as standard e 3œœee e–^MSE335ÈEXST7034 - Regression Techniques Page 6Residual diagnostics : the plots previously mentioned can also be used forresiduals with similar interpretations.a) Box plot shows the Quartilesb) Time plot or the order the observations were taken in Weld example : later welds stronger due to some learning process. Couldnot determine if not done in random order, “learning" would be fitted inwith the functional relationshipc) Stem and leaf plot can show modesd) Dot plot similar information stem and leaf plote) Normal Probability Plot available in SAS in PROC UNIVARIATE this is a plot of the values of the residuals of the ordered observations fromsmallest to largest. Untransformed, this should be sigmoid for a normaldistribution. Usually the data is transformed so that a normal distribution, when plotted,would be a straight line. The transformation for each residual is MSE*Z* È‘i 0.375n0.25 where, n is the number of observations, and i is the observation number from smallest to largestEXST7034 - Regression Techniques Page 7f)


View Full Document
Download Regression Techniques
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Regression Techniques and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Regression Techniques 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?