DOC PREVIEW
UVA STAT 2120 - Topic+02+Notes

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

STAT 2120: Notes on Topic 2 Least-squares regression: • A regression line describes a one-way linear relationship between variables. o An explanatory variable, , “explains” variability in a response variable, . o Often one wants to predict  from a given . Such a prediction is denoted  . • The least-squares regression line makes the sum of squared-prediction errors as small as possible. o A prediction error is the vertical distance between a given point and a regression line. o The formula for the least-squares regression line is  = + , with “slope” =  and “intercept” =  − . Predictions are made by plugging in values of . o Slope, , is the amount of change in  when  increases by one unit. Intercept, , is the prediction at  = 0. o Calculate  and  by computer. • Properties of the least-squares regression line: o Interchanging  and  modifies the formulation. o The line  = +  always passes through the point , . o The slope formula =  interprets the relationship in units of  and  through . o Similarly,  measures the proportion of variability in  that is explained by . • The residuals describe the leftover variation in  after fitting the least-squares regression line. o Each residual is defined by  −  . o The average of the residuals is zero. o Analysis of residuals helps to assess the suitability of a linear relationship. o A residual plot is a scatterplot of residuals against the values of . o The ideal residual plot should exhibit no systematic pattern; patterns indicating a departure from the linear relationship are: curvature, trends in spread, outliers in the residuals. o An outlier in  corresponds with an outlier in the residuals. Such is observed as an observation that outside of the overall pattern of the relationship. • Influential observations are those whose individual deletion would have a strong impact on the regression line. o An influential observation is often an outlier in , but may not be an outlier in . Cautions about correlation and regression: • Basic cautions: o Correlation is for two-way relationships, regression for one-way relationships. o Only relevant for linear relationships. o Neither is resistant. • Extrapolation is when predictions are made outside the range of data. o Often untrustworthy since the linear relationship may not hold for -values far outside those observed. • Correlation calculated on “averaged” data is higher than that calculated on individuals. • The relationship between two variables may be influenced by a third, “lurking” variable that is not observed. o Lurking variables may influence relationships between any type of variables, quantitative or categorical. • Association is not causation. o An observed association may reflect the influence of a causal lurking variable. Such is called a “nonsense correlation.” o An experiment that controls lurking variables is best for establishing causation. o It is possible to establish causation without performing an experiment that controls for lurking variables, but the evidence that arises is weaker. Relationships in categorical data: • Relationships in categorical data are explored by compiling variables in two-way tables. o A two-way table involves a row variable and a column variable. o A two-way table may record counts or percentages. Percentages are most useful because they are easy to compare in the form of distributions. • Relationships are described through specialized distributions appearing in the table. o Bar graphs provide a useful means of presenting the relevant distributions. o The distributions of the row and column variables appear in the margins of the table, and are called marginal distributions. Given as counts they are called row and column totals. o A conditional distribution is calculated from the counts of one variable limited to a given category of the other variable.o An association may be described by examining the conditional distributions of one variable across the categories of the other variable. Typically, the former would be the response variable and the latter the explanatory variable. • Lurking variables may give rise to Simpson’s paradox: patterns seen in individual categories are reversed in the patterns of the combined data. o A lurking variable may arise when a three-way table is “aggregated” into a two-way table. Introduction to producing data: • Designing the production of data allows data analysis to answer specific questions. o Data are produced on a small scale, and the intent is to generalize to a wider scale. Answering questions this way (with “confidence”) is statistical inference. o Anecdotal evidence arises haphazardly and may not represent any relevant group of cases. o “Available data” arise for other purposes, but may help to answer present questions. o In a sample survey, a sample of cases is drawn from a population. (A census is a sample that consists of the entire population.) • Confounding arises between explanatory variables when their relationships with the response are indistinguishable. • An intervention is where one imposes a change in the conditions of data-production. • In an observational study, individuals are observed, but no attempt is made to control the conditions of data-production. o Observational studies are often plagued by confounding between an observed variable and an unobserved lurking variable. • In an experiment, the conditions of data-production are controlled by applying treatments to individuals. o One objective in designing an experiment is to avoid confounding between explanatory variables. • Controlling data-production may involve questions of ethics. Designing a sample: • The key elements of a sampling study: o A population is a collection of individuals about which we want information and the conclusions of statistical inference are to be relevant. o A sampling frame lists the individuals in the population. o A sample is the subset of a population on which data are measured and put to analysis. o The response rate is the proportion of measured individuals in a preliminary sample. o The design of a sample refers to the method used


View Full Document

UVA STAT 2120 - Topic+02+Notes

Download Topic+02+Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Topic+02+Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Topic+02+Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?