/Marginal Plots, M-Plots or

Local Dependence Profiles/

(Feature Influence)

Kacper Sokol

ME captures the

average response of a predictive modelacross acollection of instances(taken from a designated data set) for aspecific value of a selected feature(found in the aforementioned data set) (Apley and Zhu 2020). This measure can berelaxedby includingsimilar feature valuesdetermined by a fixed interval around the selected value.

It communicates

global(with respect to theentireexplained model)feature influence.

ME improves upon Partial Dependence (PD) (Friedman 2001) by ensuring that the influence estimates are based on

realistic instances(thus respectingfeature correlation), making the explanatory insights more truthful.

Property |
Marginal Effect |
---|---|

relation |
post-hoc |

compatibility |
model-agnostic |

modelling |
regression, crisp and probabilistic classification |

scope |
global (per data set; generalises to cohort) |

target |
model (set of predictions) |

Property |
Marginal Effect |
---|---|

data |
tabular |

features |
numerical and categorical |

explanation |
feature influence (visualisation) |

caveats |
feature correlation, heterogeneous model response |

\[ X_{\mathit{ME}} \subseteq \mathcal{X} \]

\[ V_i = \{ x_i : x \in X_{\mathit{ME}} \} \]

\[ \mathit{ME}_i = \mathbb{E}_{X_{\setminus i} | X_{i}} \left[ f \left( X_{\setminus i} , X_{i} \right) | X_{i}=v_i \right] = \int_{X_{\setminus i}} f \left( X_{\setminus i} , x_i \right) \; d \mathbb{P} ( X_{\setminus i} | X_i = v_i ) \;\; \forall \; v_i \in V_i \]

\[ \mathit{ME}_i = \mathbb{E}_{X_{\setminus i} | X_{i}} \left[ f \left( X_{\setminus i} , X_{i} \right) | X_{i}=V_i \right] = \int_{X_{\setminus i}} f \left( X_{\setminus i} , x_i \right) \; d \mathbb{P} ( X_{\setminus i} | X_i = V_i ) \]

Based on the ICE notation (Goldstein et al. 2015)

\[ \left\{ \left( x_{S}^{(i)} , x_{C}^{(i)} \right) \right\}_{i=1}^N \]

\[ \hat{f}_S = \mathbb{E}_{X_{C} | X_S} \left[ \hat{f} \left( X_{S} , X_{C} \right) | X_S = x_S \right] = \int_{X_C} \hat{f} \left( x_{S} , X_{C} \right) \; d \mathbb{P} ( X_{C} | X_S = x_S ) \]

\[ \mathit{ME}_i \approx \frac{1}{\sum_{x \in X_{\mathit{ME}}} \mathbb{1} (x_i = v_i)} \sum_{x \in X_{\mathit{ME}}} f \left( x | x_i=v_i \right) \]

Measures ME for a range of values \(v_i \pm \delta\) around a selected value \(v_i\), instead of doing so precisely at that point.

\[ \mathit{ME}_i^{\pm\delta} = \mathbb{E}_{X_{\setminus i} | X_{i}} \left[ f \left( X_{\setminus i} , X_{i} \right) | X_{i}=v_i \pm \delta \right] = \int_{X_{\setminus i}} f \left( X_{\setminus i} , x_i \right) \; d \mathbb{P} ( X_{\setminus i} | X_i = v_i \pm \delta ) \;\; \forall \; v_i \in V_i \]

or

\[ \hat{f}_S^{\pm\delta} = \mathbb{E}_{X_{C} | X_S} \left[ \hat{f} \left( X_{S} , X_{C} \right) | X_S = x_S \pm \delta \right] = \int_{X_C} \hat{f} \left( x_{S} , X_{C} \right) \; d \mathbb{P} ( X_{C} | X_S = x_S \pm \delta ) \]

**Easy to generate and interpret**- Based on
**real data**

- Assumes
**feature independence**, which is often unreasonable and heavily biases the influence measurements - May be
**unreliable for certain values**of the explained feature when there is a low number of data points with that value (**strict**) or in a relevant bin (**relaxed**); this impacts the reliability of influence estimates (average perdiction of the explained model for that value or range of values) - Reliability of estimates can only be communicated by displaying a rug plot or distribution of instances per value or bin
- Diversity (heterogeneity) of the model’s behaviour for each particular value or bin can only be communicated by
**prediction variance** - Limited to explaining
**two feature at a time**

- The measurements may be sensitive to different binning approaches for
**relaxed**ME - Computational complexity: \(\mathcal{O} \left( n \right)\), where \(n\) is the number of instances in the designated data set

An evolved version of (relaxed) ME that is less prone to being affected by feature correlation. It communicates the influence of a specific feature value on the model’s prediction by quantifying the average (accumulated) difference between the predictions at the boundaries of a (small)

fixed intervalaround the selected feature value (Apley and Zhu 2020). It is calculated by replacing the value of the explained feature with the interval boundaries forinstances found in the designated data setwhose value of this feature is within the specified range.

It communicates the influence of a specific feature value on the model’s prediction by

fixing the value of this featureacross a designated range for a selected data point (Goldstein et al. 2015). It is an instance-focused (local) “variant” ofPartial Dependence.

It communicates the average influence of a specific feature value on the model’s prediction by

fixing the value of this featureacross a designated range for a set of instances. It is a model-focused (global) “variant” ofIndividual Conditional Expectation, which is calculated byaveragingICE across a collection of data points (Friedman 2001).

Python | R |
---|---|

N/A | DALEX |

Apley, Daniel W, and Jingyu Zhu. 2020. “Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models.” *Journal of the Royal Statistical Society: Series B (Statistical Methodology)* 82 (4): 1059–86.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” *Annals of Statistics*, 1189–1232.

Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” *Journal of Computational and Graphical Statistics* 24 (1): 44–65.