Explaining Data

Kacper Sokol

Describing Data

Data as a Model

  • Representation of some underlying phenomenon – an implicit model
  • Inherent assumptions as well as measurement limitations and errors



  • Collection influence by factors such as world view and mental model
  • Possibly partial and subjective
  • Embedded cultural biases, e.g., “How much is a lot?”

Data Characteristics


Summary statistics

  • feature distribution
  • per-class feature distribution
  • feature correlation
  • class distribution and ratio

Data Characteristics    

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Data Characteristics    

Iris feature correlation

Data Characteristics    

Iris feature and target correlation

Data Characteristics    

Iris feature distribution

Data Characteristics    

Iris feature distribution per class

Data Characteristics    

Iris feature distribution per class

Data Characteristics    


Transform characteristics and observations into explanations

  • “The classes are balanced”
  • “The data are bimodal”
  • “These features are highly correlated”

Data Documentation

  • experimental setup (implicit assumptions)
  • collection methodology (by whom and for what purpose)
  • applied pre-processing (cleaning and aggregation)
  • privacy aspects
  • data owners

Data Documentation    

Model-based Explainability

Instance-based Explainability

Iris clustering and centroids

Dimensionality Reduction

  • Embeddings
  • Projections

Dimensionality Reduction    

t-SNE embeddings

Dimensionality Reduction    

t-SNE embeddings

Wrap Up

Summary

  • Explainability is relevant to data collection and processing
  • We usually have to make some modelling assumptions
  • Parameterisation may be tricky

Bibliography

Arnold, Matthew, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, et al. 2019. FactSheets: Increasing Trust in AI Services Through Supplier’s Declarations of Conformity.” IBM Journal of Research and Development 63 (4/5): 6:1–13. https://doi.org/10.1147/JRD.2019.2942288.
Bender, Emily M, and Batya Friedman. 2018. “Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.” Transactions of the Association for Computational Linguistics 6: 587–604. https://doi.org/10.1162/tacl_a_00041.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. “Datasheets for Datasets.” 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018) at the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden.
Holland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. “The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards.” arXiv Preprint arXiv:1805.03677.
Kelley, Patrick Gage, Joanna Bresee, Lorrie Faith Cranor, and Robert W. Reeder. 2009. “A ‘Nutrition Label’ for Privacy.” In Proceedings of the 5th Symposium on Usable Privacy and Security, 4:1–12. SOUPS ’09. New York, NY, USA: ACM. https://doi.org/10.1145/1572532.1572538.
Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. “Model Cards for Model Reporting.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–29. ACM.
Reisman, Dillon, Jason Schultz, Kate Crawford, and Meredith Whittaker. 2018. “Algorithmic Impact Assessments: A Practical Framework for Public Agency Accountability.” AI Now Institute.
Sokol, Kacper, and Peter Flach. 2020. “Explainability Fact Sheets: A Framework for Systematic Assessment of Explainable Approaches.” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 56–67.
Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9 (11).
Yang, Ke, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, HV Jagadish, and Gerome Miklau. 2018. “A Nutritional Label for Rankings.” In Proceedings of the 2018 International Conference on Management of Data, 1773–76. ACM.

Questions