10-3 Prediction Intervals and Variation 547 Figure 10-7 shows that the point (5, 13) lies on the regression line, but the point (5, 19) from the original data set does not lie on the regression line. If we completely ignore correlation and regression concepts and want to predict a value of y given a value of x and a collection of paired 1x, y2 data, our best guess would be the mean y = 9. But in this case there is a linear correlation between x and y, so a better way to predict the value of y when x = 5 is to substitute x = 5 into the regression equation to get yn = 13. We can explain the discrepancy between y = 9 and yn = 13 by noting that there is a linear relationship best described by the regression line. Consequently, when x = 5, the predicted value of y is 13, not the mean value of 9. For x = 5, the predicted value of y is 13, but the observed sample value of y is actually 19. The discrepancy between yn = 13 and y = 19 cannot be explained by the regression line, and it is called a residual or unexplained deviation, which can be expressed in the general format of y - yn. As in Section 3-2, where we defined the standard deviation, we again consider a deviation to be a difference between a value and the mean. (In this case, the mean is y = 9.) Examine Figure 10-7 carefully and note these specific deviations from y = 9: Total deviation 1from y = 92 of the point 15, 192 = y - y = 19 - 9 = 10 Explained deviation 1from y = 92 of the point 15, 192 = yn - y = 13 - 9 = 4 Unexplained deviation 1from y = 92 of the point 15, 192 = y - yn = 19 - 13 = 6 These deviations from the mean are generalized and formally defined as follows. DEFINITIONS Assume that we have a collection of paired data containing the sample point 1x, y2, that yn is the predicted value of y (obtained by using the regression equation), and that the mean of the sample y values is y. The total deviation of 1x, y2 is the vertical distance y - y, which is the distance between the point 1x, y2 and the horizontal line passing through the sample mean y. The explained deviation is the vertical distance yn - y, which is the distance between the predicted y value and the horizontal line passing through the sample mean y. The unexplained deviation is the vertical distance y - yn, which is the vertical distance between the point 1x, y2 and the regression line. (The distance y - yn is also called a residual, as defined in Section 10-2.) In Figure 10-7 we can see the following relationship for an individual point 1x, y2: 1total deviation2 = 1explained deviation2 + 1unexplained deviation2 1y - y 2 = 1yn - y2 + 1y - yn2 The expression above involves deviations away from the mean, and it applies to any one particular point 1x, y2. If we sum the squares of deviations using all points 1x, y2, we get amounts of variation. The same relationship applies to the sums of squares shown in Formula 10-7, even though the expression above is not algebraically equivalent to Formula 10-7. In Formula 10-7, the total variation is the sum of the
RkJQdWJsaXNoZXIy NjM5ODQ=