--- date: 2022-08-19 --- # Anomaly Detection ## Abstract >[!Abstract] What makes an observation **unusual**? And how can we **detect** them? >1. **[[Leverage]]** > - The leverage is a measure of distance, where individual observations are compared agaist the average of all observations. This method detects unusual *features* > $h_{ii} = x_i^\prime (X^\prime X)^{-1}x_i$ >2. **[[Residuals]]** > - Regression residuals measure the difference between predicted outcome values and observed outcome values (predicted vs test set). The residuals, in essence, capture what the model is not able to explain. A high residual is an indicator of an outlier. > $\hat{e_i} = y_i - \hat{y_i} $ > $\hat{e_i} = y_i - \hat{\beta}x_i$ -> The model itself >3. **[[Influence]]** > - If a data point, or observation, is *influenctial* then removing it will cause a significant change in the estimated model. Computationally, Influence is a combination of both Leverage and Residuals. > $\hat{\beta} - \hat{\beta_{-1}} = \frac{(X^\prime X)^{-1}x_i \cdot e_i }{1 - h_{ii}}$ One of the common tasks within Data Science and statistics is anomaly detection. A systematic approach used to detect unusual data points. But what makes an event unusual? Although being unusual is not necessarily bad, it is important to understand and be able to explain outlying data points as they might be: - Different in the type of information they contain - Or carry more information than the typical data point - Cause bias in the model - A result of a different data-generating process - Indicative of measurement error - Indicative of fraud >[!info] Shall I just drop outliers? **Domain knowledge** is always king and dropping observations only for statistical reasons is never wise ![[Leverage]] ![[Residuals]] ![[Influence]] >[!Conclusion] >In this post, we have seen a couple of different ways in which observations can be “unusual”: they can have either **unusual characteristics** or **unusual behavior**. In linear regression, when an observation has both it is also influential: it tilts the model towards itself. ## References - [Outliers, Leverage, Residuals, and Influential Observations](https://medium.com/towards-data-science/outliers-leverage-residuals-and-influential-observations-df3065a0388e) - D. Cook, [Detection of Influential Observation in Linear Regression](https://www.jstor.org/stable/1268249) (1980), _Technometrics_. - D. Cook, S. Weisberg, [Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression](https://www.jstor.org/stable/1268187) (1980), _Technometrics_. - P. W. Koh, P. Liang, [Understanding Black-box Predictions via Influence Functions](http://proceedings.mlr.press/v70/koh17a) (2017), _ICML Proceedings_.