(Feature Influence)

Kacper Sokol

ALE captures the influence of a specific feature value on the model’s prediction by quantifying the average (accumulated)

differencebetween the predictions at the boundaries of a (small)fixed intervalaround the selected feature value (Apley and Zhu 2020). It is calculated by replacing the value of the explained feature with the interval boundaries forinstances found in the designated data setwhose value of this feature is within the specified range.

It communicates

global(with respect to theentireexplained model)feature influence.

ALE is an evolved version of (relaxed) Marginal Effect (ME) (Apley and Zhu 2020) that is less prone to being affected by feature correlation since it relies upon average prediction

change. It also improves upon Partial Dependence (PD) (Friedman 2001) by ensuring that the influence estimates are based onrealistic instances(thus respectinginteractions between features/feature correlation), making the explanatory insights more truthful.

Property |
Accumulated Local Effect |
---|---|

relation |
post-hoc |

compatibility |
model-agnostic |

modelling |
regression and probabilistic classification (numbers) |

scope |
global (per data set; generalises to cohort) |

target |
model (set of predictions) |

Property |
Accumulated Local Effect |
---|---|

data |
tabular |

features |
numerical (ordinal categorical) |

explanation |
feature influence (visualisation) |

caveats |
feature binning |

\[ X_{\mathit{ALE}} \subseteq \mathcal{X} \]

\[ V_i = \{ x_i : x \in X_{\mathit{ALE}} \} \]

\[ \mathit{ALE}_i = \int_{v_{0}}^{x_i} \mathbb{E}_{X_{\setminus i} | X_{i}=x_i} \left[ f^i \left( X_{\setminus i} , X_{i} \right) | X_{i}=v_i \right] \; d v_i - \mathit{const} \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;= \int_{v_{0}}^{x_i} \left ( \int_{X_{\setminus i}} f^i \left( X_{\setminus i} , v_i \right) \; d \mathbb{P} ( X_{\setminus i} | X_i = v_i ) \right ) \; d v_i - \mathit{const} \]

\[ f^i (x_{\setminus i}, x_i) = \frac{\partial f (x_{\setminus i}, x_i)}{\partial x_i} \]

Based on the ICE notation (Goldstein et al. 2015)

\[ \hat{f}_S = \int_{z_{0, S}}^{x_S} \mathbb{E}_{X_{C} | X_S = x_S} \left[ \hat{f}^{S} \left( X_{S} , X_{C} \right) | X_S = z_S \right] \; d z_{S} - \mathit{const} \\ \;\;\;\;\;\;\;\;= \int_{z_{0, S}}^{x_S} \left ( \int_{X_C} \hat{f}^{S} \left( z_{S} , X_{C} \right) \; d \mathbb{P} ( X_{C} | X_S = z_S ) \right ) \; d z_{S} - \mathit{const} \]

\[ \hat{f}^{S} (x_s, x_c) = \frac{\partial \hat{f} (x_S, x_C)}{\partial x_S} \]

\[ \mathit{ALE}_i^{j} \approx \sum_{n=1}^{j} \frac{1}{|Z_n|} \sum_{x \in Z_n} \left[ f \left( x_{\setminus i} , x_i=Z_n^+ \right) - f \left( x_{\setminus i} , x_i=Z_n^- \right) \right] \]

\[ \overline{\mathit{ALE}_i^{j}} = \mathit{ALE}_i^{j} - \frac{1}{\sum_{Z_n \in Z} |Z_n|} \sum_{x \in Z} \mathit{ALE}_i(x) \]

Given the need for binning, various approaches such as:

- quantile,
- equal-width or
- custom.
can be used.

(Examples to follow.)

ALE of a

single featurecapturesonlythe effect of this particular feature on the explained model’s predictive behaviour – known asfirst-order effect. ALE ofmultiple featurescapture theexclusiveeffect of theinteractionbetweennfeatures on the explained model’s predictive behaviour (adjusted for the overall effect as well as the main effect of each feature) – known asn, e.g.,^{th}-order effectsecond-order effect.(Examples to follow.)

**Easy and fast to generate****Reasonably easy to interpret**(first-order ALE)Reliable when features are correlated (

**unbiased**)Based on data that are

**closely distributed to the real data**

**Not so easy to implement****Tricky to interpret for orders higher than first**Limited to explaining

**two feature at a time**ALE trends

**should not be generalised to individual instances**across the feature range since the estimates are specific to each bin

**Binning may skew the results**(aided by displaying distribution of instances per bin); e.g.,*quantiles*ensure good estimates given the number of instances per bin, but may yield unusually long and short bins;*fixed-width*offers regular bins, but some may lack a sufficient number of points to offer reliable estimates

- The measurements may be sensitive to different binning approaches
- Computational complexity: \(\mathcal{O} \left( n \right)\), where \(n\) is the number of instances in the designated data set

ME captures the

average response of a predictive modelacross acollection of instances(taken from a designated data set) for aspecific value of a selected feature(found in the aforementioned data set) (Apley and Zhu 2020). Whenrelaxedby includingsimilar feature valuesdetermined by a fixed interval around the selected value, this method offers similar insights to ALE: average prediction per interval instead of (accumulated) difference in prediction per interval.

It communicates the influence of a specific feature value on the model’s prediction by

fixing the value of this featureacross a designated range for a selected data point (Goldstein et al. 2015). It is an instance-focused (local) “variant” ofPartial Dependence.

It communicates the average influence of a specific feature value on the model’s prediction by

fixing the value of this featureacross a designated range for a set of instances. It is a model-focused (global) “variant” ofIndividual Conditional Expectation, which is calculated byaveragingICE across a collection of data points (Friedman 2001).

Python | R |
---|---|

ALEPython | ALEPlot |

alibi | DALEX |

iml |

Apley, Daniel W, and Jingyu Zhu. 2020. “Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models.” *Journal of the Royal Statistical Society: Series B (Statistical Methodology)* 82 (4): 1059–86.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” *Annals of Statistics*, 1189–1232.

Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” *Journal of Computational and Graphical Statistics* 24 (1): 44–65.

Grömping, Ulrike. 2020. “Model-Agnostic Effects Plots for Interpreting Machine Learning Models.” *Reports in Mathematics, Physics and Chemistry, Department II, Beuth University of Applied Sciences Berlin Report* 1.