Evaluating Explainability

Kacper Sokol

Current Practice

Lack of Consensus

  • What to evaluate

  • How to evaluate it

  • Approaches

    • Technical – numerical evaluation
    • Social – user studies

Evaluation Tiers

Humans Task
Application-grounded Evaluation Real Humans Real Tasks
Human-grounded Evaluation Real Humans Simple Tasks
Functionally-grounded Evaluation No Real Humans Proxy Tasks

Numerical Evaluation

Global fidelity

Numerical Evaluation    

Local fidelity

Numerical Evaluation    

Surrogate fidelity

Numerical Evaluation    

Black-box fidelity

Desiderata-based Evaluation

  • Interactiveness (U4)
  • Actionability (U5)
  • Novelty (U8)

    See the taxonomy module for a review of explainability desiderata.

Human-based Evaluation

  • Evaluating simulatability is insufficient
  • Same for task-completion
  • We need to assess understanding

Way Forward

Beyond Human-centred Evaluation

  • Shift towards human-centred explainability may have overcompensated
  • These are socio-technical systems – both aspects should be accounted for

Automated Decision-making

Automated decision-making workflow

Naïve view

Current validation

Explanatory insight & presentation medium

Proposed validation 1

Phenomenon & explanation

Proposed validation 2

Wrap Up


  • Evaluation is task-specific and context-dependent

  • It should account for both aspect of XML systems

    • Technical – the algorithms generating insights
    • Social – the explanatory artefacts and communication media
  • Overall, it should assess human understanding


Doshi-Velez, Finale, and Been Kim. 2017. “Towards a Rigorous Science of Interpretable Machine Learning.” arXiv Preprint arXiv:1702.08608.