Design a site like this with
Get started


An AI system is interpretable if… it is possible to translate its working principles and outcomes in human-understandable language without affecting the validity of the system.

No consensus has yet been reached in the literature to define interpretability.

Interpretability is often confused with more abstract notions of fairness, privacy and transparency (Weller 2019). These terms do not have a clear and unique definition and the understanding of these terms may differ depending on the domain and context.

The words interpretable and explainable have been used interchangeably in some articles (Miller 2019; Lipton 2018), while others use a strong distinction between the two terms (Rudin 2019). Undoubtedly, there is a link between the act of interpreting and that of explaining, as shown by the etymology of the words themselves. Interpretability has been presented as “explaining or presenting in understandable terms to a human”, “providing explanations” to humans (Miller 2019) and “assigning meaning to an explanation” (Palacio et al. 2021). For (Rudin 2019), however, there is a strong distinction between interpreting and explaining since models may be developed to directly encompass the ability to explain their decision-making. In this case, interpretability refers to meeting the transparency requirement at the task definition level, whereas explanation refers to a post-hoc (after training) evaluation of the model understandability.

The integration is difficult also in other domains that are driving and shaping AI development. Policies for funding and regulating AI research also refer to concepts such as transparencyexplicabilityreliabil- ityinformed consentaccountability, and auditability of the systems (Bibal et al. 2020, 2020; Edwards and Veale 2017), but different meanings are observed. 

A formal definition of interpretability exists in the field of mathematical logic, and it can be informally summarized as the possibility of interpreting, or translating, one formal theory into another while preserving the validity of each theorem in the original theory during the translation [Tarski, 1953]. From this starting point, a definition of interpretable AI can be derived as in the following:

“An AI system is interpretable if it is possible to translate its working principles and outcomes in human-understandable language without affecting the validity of the system”

This definition represents the shared goal that several technical approaches aim to obtain when applied to AI. In some cases, the definition is relaxed to include approximations of the AI system that maintain its validity as much as possible.

Note that AI interpretability ca be sought by design, namely by incorporating the translation objective within the model objectives as an additional target of the system (built-in interpretability), or by post-hoc analyses that aim at improving the understandability of how the outcome was generated.

An alternative formal definition of interpretability is given by Doshi-Velez and Kim as the act of “explaining or presenting in understandable terms to a human. We will look at this definition in the next class.


Interpretability vs. explainability, transparency, etc.

Social sciences use differing definitions for interpretability than the technical sciences.

A formal definition can be derived from mathematical logic

Interpretability can be sought by design or by post-hoc analyses


Graziani et al., 2022
A global taxonomy of interpretable AI: unifying the terminology for the technical and social sciences

%d bloggers like this: