Interpretability

Module: ethics

What it is

Interpretability is understanding what's happening inside an AI model. While explainability focuses on outputs, interpretability examines internal representations—what concepts the model has learned, how information flows through layers, why certain neurons activate.

Why it matters

Interpretability research helps us understand AI at a deeper level than just explaining outputs. It can reveal unexpected behaviours, hidden biases, or concerning capabilities. As AI systems become more powerful, interpretability becomes important for ensuring they're working as intended.