Explaining Data

Kacper Sokol

Describing Data

Data as a Model

Representation of some underlying phenomenon – an implicit model
Inherent assumptions as well as measurement limitations and errors

Collection influence by factors such as world view and mental model
Possibly partial and subjective
Embedded cultural biases, e.g., “How much is a lot?”

Data Characteristics

Summary statistics

feature distribution
per-class feature distribution
feature correlation
class distribution and ratio

Data Characteristics

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Data Characteristics

Data Characteristics

Data Characteristics

Data Characteristics

Data Characteristics

Transform characteristics and observations into explanations

“The classes are balanced”
“The data are bimodal”
“These features are highly correlated”

Data Documentation

Data Statements (Bender and Friedman 2018)
Data Sheets for Data Sets (Gebru et al. 2018)
Nutrition Labels for Data Sets (Holland et al. 2018)

experimental setup (implicit assumptions)
collection methodology (by whom and for what purpose)
applied pre-processing (cleaning and aggregation)
privacy aspects
data owners

Data Documentation

ML and AI services (Arnold et al. 2019)
predictive models (Mitchell et al. 2019)
privacy aspects (Kelley et al. 2009)
ranking algorithms (Yang et al. 2018)
AI explainability (Sokol and Flach 2020)
algorithmic impact (Reisman et al. 2018)

Model-based Explainability

Instance-based Explainability

Dimensionality Reduction

Embeddings
Projections

Dimensionality Reduction

Dimensionality Reduction

Wrap Up

Summary

Explainability is relevant to data collection and processing
We usually have to make some modelling assumptions
Parameterisation may be tricky

Bibliography

Arnold, Matthew, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, et al. 2019. “FactSheets: Increasing Trust in AI Services Through Supplier’s Declarations of Conformity.” IBM Journal of Research and Development 63 (4/5): 6:1–13. https://doi.org/10.1147/JRD.2019.2942288.

Bender, Emily M, and Batya Friedman. 2018. “Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.” Transactions of the Association for Computational Linguistics 6: 587–604. https://doi.org/10.1162/tacl_a_00041.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. “Datasheets for Datasets.” 5^th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018) at the 35^th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden.

Holland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. “The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards.” arXiv Preprint arXiv:1805.03677.

Kelley, Patrick Gage, Joanna Bresee, Lorrie Faith Cranor, and Robert W. Reeder. 2009. “A ‘Nutrition Label’ for Privacy.” In Proceedings of the 5^th Symposium on Usable Privacy and Security, 4:1–12. SOUPS ’09. New York, NY, USA: ACM. https://doi.org/10.1145/1572532.1572538.

Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. “Model Cards for Model Reporting.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–29. ACM.

Reisman, Dillon, Jason Schultz, Kate Crawford, and Meredith Whittaker. 2018. “Algorithmic Impact Assessments: A Practical Framework for Public Agency Accountability.” AI Now Institute.

Sokol, Kacper, and Peter Flach. 2020. “Explainability Fact Sheets: A Framework for Systematic Assessment of Explainable Approaches.” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 56–67.

Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9 (11).

Yang, Ke, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, HV Jagadish, and Gerome Miklau. 2018. “A Nutritional Label for Rankings.” In Proceedings of the 2018 International Conference on Management of Data, 1773–76. ACM.

Explaining Data

Describing Data

Data as a Model

Data Characteristics

Data Characteristics

Data Characteristics

Data Characteristics

Data Characteristics

Data Characteristics

Data Characteristics

Data Characteristics

Data Documentation

Data Documentation

Model-based Explainability

Instance-based Explainability

Dimensionality Reduction

Dimensionality Reduction

Dimensionality Reduction

Wrap Up

Summary

Bibliography

Questions