Influence - nati.sh

## Influence One way of detecting outliers is by looking at the features of the data, i.e. using leverage. Another way is by looking at behavior, i.e. using model residuals. The question arises then, how do we know which data points have an important influence on the model. We can answer this question using the concept of **influence and influence functions.** The general idea is that if a data point, or observation, is *influenctial* then removing it will cause a significant change in the estimated model. In linear regression, the influence of observation $i$ is defined as: $\hat{\beta} - \hat{\beta_{-1}} = \frac{(X^\prime X)^{-1}x_i \cdot e_i }{1 - h_{ii}}$ Where $\hat{\beta_i}$ is the OLS coefficient estimated omitting observation $i$. There is a close relationship between influence, leverage and residuals. This is self-evident. In linear regression, data points with high leverage are those who are both outliers and with high residuals; both conditions must be met in order for a data point to qualify as an influencer. ```python df['influence'] = (np.linalg.inv(X.T @ X) @ X.T).T * np.abs(Y - Y_hat) df['high_influence'] = df['influence'] > (np.mean(df['influence']) + 2*np.std(df['influence'])) ``` Ploting it: ```python fix, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6)) sns.histplot(data=df, x='influence', hue='high_influence', alpha=1, bins=30, ax=ax1).set(title='Distribution of Influences'); sns.scatterplot(data=df, x='hours', y='transactions', hue='high_influence', ax=ax2).set(title='Data Scatterplot'); ``` ![[influence_plot.png]] The plot shows only one observation that shows high influence, and its value is disproportionally larger than the influence of all other data points. As you can see in the scatterplot, its not always necessairly easy to be able to tell which data point has high residual. Here is a function to plot leverage, residuals and influence: ```python def plot_leverage_residuals(df): # Hue df['type'] = 'Normal' df.loc[df['high_residual'], 'type'] = 'High Residual' df.loc[df['high_leverage'], 'type'] = 'High Leverage' df.loc[df['high_influence'], 'type'] = 'High Influence' # Init figure fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5)) ax1.plot(X, Y_hat, lw=1, c='grey', zorder=0.5) sns.scatterplot(data=df,x='residual',y='leverage',hue='type',ax=ax2).set(title='Metrics') ax1.get_legend().remove() sns.move_legend(ax2, "upper left", bbox_to_anchor=(1.05, 0.8)); ``` Resulting: ```python plot_leverage_residuals(df) ``` ![[residual_leverage_influence_plot.png]] >[!Note] What qualifies an observation to be an influencer? > It is clear that the two conditions, leverage and residual, alone are not sufficient for an observation to be influential and *distort* the model.