/Model Reliance/

(Feature Importance)

Kacper Sokol

PFI – sometimes called

Model Reliance(Fisher, Rudin, and Dominici 2019) – quantifiesimportance of a featureby measuring the change in predictive error incurred whenpermuting its valuesfor acollection of instances(Breiman 2001).

It communicates

global(with respect to theentireexplained model)feature importance.

PFI was originally introduced for Random Forests (Breiman 2001) and later generalised to a model-agnostic technique under the name of

Model Reliance(Fisher, Rudin, and Dominici 2019).

Property |
Permutation Feature Importance |
---|---|

relation |
post-hoc |

compatibility |
model-agnostic |

modelling |
regression, crisp and probabilistic classification |

scope |
global (per data set; generalises to cohort) |

target |
model (set of predictions) |

Property |
Permutation Feature Importance |
---|---|

data |
tabular |

features |
numerical and categorical |

explanation |
feature importance (numerical reporting, visualisation) |

caveats |
feature correlation, model’s goodness of fit, access to data labels, robustness (randomness of permutation) |

\[ I_{\textit{PFI}}^{j} = \frac{1}{N} \sum_{i = 1}^N \frac{\overbrace{\mathcal{L}(f(X^{(j)}), Y)}^{\text{permute feature j}}}{\mathcal{L}(f(X), Y)} \]

- Difference \[ \mathcal{L}(f(X^{(j)}), Y) - \mathcal{L}(f(X), Y) \]
- Quotient \[ \frac{\mathcal{L}(f(X^{(j)}), Y)}{\mathcal{L}(f(X), Y)} \]

- Percent change \[ 100 \times \frac{\mathcal{L}(f(X^{(j)}), Y) - \mathcal{L}(f(X), Y)}{\mathcal{L}(f(X), Y)} \]

- PFI needs a
**representative**sample of data to output a meaningful explanation - The meaning of PFI is decided by the sample of data used for its generation

Some choices are

**Training Data**– instances used to train the explained model**Validation Data**– instances used to evaluate the predictive performance the explained model; also employed for hyperparameter tuning**Test Data**– instances used to estimate the final, unbiased predictive performance of the explained model**Explainability Data**– a separate pool of instances reserved for explaining the behaviour of the model

This explanation communicates how the model relies on data features

during training, but not necessarily how the features influence predictions of unseen instances. The model may learn a relationship between a feature and the target variable that is due to a quirk of the training data – a random pattern present only in the training data sample that, e.g., due tooverfitting, can add some extra performance just for predicting the training data.

The spurious correlations between data features and the target found

uniquelyin thetraining dataor extracted due tooverfittingareabsentin thetest data(previously unseen by the model). This allows PFI to communicate how useful each feature is for predicting the target, or whether some of the data feature contributed to overfitting.

We can measure feature importance with alternative techniques such as Partial Dependence-based feature importance. This metric may not pick up the

random feature’s lack of predictive powersince PD generatesunrealistic instancesthat could follow the spurious pattern found in the training data.

Since the underlying predictive model (the one being explained) is a Decision Tree, we have access to its

native estimate of feature importance. It conveys the overalldecrease in the chosen impurity metricfor all splits based on a given feature, by default calculated over the training data.

**Easy to generate and interpret****All of the features can be explained at the same time****Computationally efficient**in comparison to a brute-force approach such as*leave-one-out and retrain*(which also has a different interpretation)- Accounts for the importance of the explained feature and all of its
**interactions with other features**(which can also be considered a*disadvantage*)

- Requires
**access to ground truth**(i.e., data and their labels) - Influenced by
**randomness of permuting feature values**(somewhat abated by repeating the calculation mulitple times at the expense of extra compute) **Relies on the underlying model’s goodness of fit**since it is based on (the drop in) a predictive perfromance metric (in contrast to a more generic change in predictive behaviour – think predictive**robustness**)

Assumes

**feature independence**, which is often unreasonableMay not reflect the true feature importance since it is based upon the predictive ability of the model for

**unrealistic instances**In presence of feature interaction, the

**importance**– that one of the attributes would accumulate if alone – may be**distributed**across all of them in an arbitrary fashion (pushing them down the order of importance)Since it accounts for indiviudal and interaction importance, the latter component is

**accounted for multiple times**, making the sum of the scores inconsistent with (larger than) the drop in predictive performance (for the difference-based variant)

PFI is parameterised by:

- data set
- predictive performance metric
- number of repetitions

Generating PFI may be computationally expensive for

*large sets of data*and*high number of repetitions*Computational complexity: \(\mathcal{O} \left( n \times d \right)\), where

- \(n\) is the number of instances in the designated data set and
- \(d\) is the number of permutation repetitions

Many data-driven predictive models come equipped with some variant of feature importance. This includes Decision Trees and Linear Models among many others.

Partial Dependencecaptures theaverage response of a predictive modelfor acollection of instanceswhenvarying one of their features(Friedman 2001). By assessingflatnessof these curves we can derive a feature importance measurement (Greenwell, Boehmke, and McCarthy 2018).

SHapley Additive exPlanationsexplains a prediction of a selected instance by usingShapley valuesto computing the contribution of each individual feature to this outcome (Lundberg and Lee 2017). It comes with variousaggregation mechanismsthat allow to transform individual explanations into global, model-based insights such asfeature importance.

Local Interpretable Model-agnostic Explanationsis a surrogate explainer that fits a linear model to data (expressed in an interpretable representaion) sampled in the neighbourhood of an instance selected to be explained (Ribeiro, Singh, and Guestrin 2016). This local, inherently transparent model simplifies the black-box decision boundary in the selected sub-space, making it human-comprehensible. Given that these explanations are based oncoefficients of the surrogate linear model, they can also be interpreted as (interpretable) feature importance.

Python | R |
---|---|

scikit-learn (`>=0.24.0` ) |
iml |

alibi | vip |

Skater | DALEX |

rfpimp |

Breiman, Leo. 2001. “Random Forests.” *Machine Learning* 45 (1): 5–32.

Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 2019. “All Models Are Wrong, but Many Are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously.” *J. Mach. Learn. Res.* 20 (177): 1–81.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” *Annals of Statistics*, 1189–1232.

Greenwell, Brandon M, Bradley C Boehmke, and Andrew J McCarthy. 2018. “A Simple and Effective Model-Based Variable Importance Measure.” *arXiv Preprint arXiv:1805.04755*.

Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” *Advances in Neural Information Processing Systems* 30.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “‘Why Should I Trust You?’ Explaining the Predictions of Any Classifier.” In *Proceedings of the 22*^{nd} ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–44.

Wei, Pengfei, Zhenzhou Lu, and Jingwen Song. 2015. “Variable Importance Analysis: A Comprehensive Review.” *Reliability Engineering & System Safety* 142: 399–432.