Interpretability
Module: ethics
What it is
Interpretability is understanding what's happening inside an AI model. While explainability focuses on outputs, interpretability examines internal representations—what concepts the model has learned, how information flows through layers, why certain neurons activate.
Why it matters
Interpretability research helps us understand AI at a deeper level than just explaining outputs. It can reveal unexpected behaviours, hidden biases, or concerning capabilities. As AI systems become more powerful, interpretability becomes important for ensuring they're working as intended.