Regression to the Mean

# Regression to the Mean Remember that the **[[Linear Regression|regression line]]** computes the average value of $y$ when the first coordinate is near $x$. $\sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - (a+bx_i))^2$ $b = r\frac{s_y}{s_x}$ $a = y - b\bar{x}$ The main use of regression is to predict $y$ from $x$: - Given $x$, predict $y$ to be $\hat{y} = a + bx$ - The prediction for $y$ at $x=\bar{x}$ is simply $\hat{y} = \bar{y}$. But $b=r\frac{s_y}{s_x}$ means that if $x$ is one standard deviation $s_x$ above $\bar{x}$, then the predicted $\hat{y}$ is only $rs_y$ above $\bar{y}$. Since $r$ is between -1 and 1, the prediction is *towards the mean*: $\hat{y}$ is fewer standard deviations away from $\bar{y}$ than $x$ is from $\bar{x}$ A heuristic explanation: to score among the very top on the midterm requires excellent preparation as well as some luck. This luck may not be there any more on the final exam, and so we expect this group to fall back a a bit. - This effect is simply a consequence of there being a scatter around the line. Erroneously assuming that this occurs due to some action (e.g. the top scorers on the midterm slacked off) it the **regression fallacy**