Introduction to Interpretable Machine Learning

By Seojin Bang

We have witnessed the rise of sophisticated, autonomous task-performing systems such as self-driving cars, product recommendations, language translators, and targeted advertisements. These systems are driven by artificial intelligence (AI), deep learning systems that derive benefits from massive datasets and high-performance infrastructures that have achieved near-human-level performance. The problem is, we don’t really understand how AI works and why it makes the decisions it does.

What is a black box system?

We call a system a black box when:

  1. It’s so complicated that its logic is not comprehensible to humans, even when we know the exact form of the system. For example, a person who creates and trains a deep neural network knows what type of layers and activation functions are used, how it is back propagated, and what the weights are, however, they probably don’t know why the deep neural network produces its decisions.
  2. Its decision mechanism is unknown or secret to stakeholders. In this case, we only know the output of the system given a particular input, but not why and how it got there. For example, we still do not fully understand how human intelligence works. Another example might be when a company declines to reveal the decision mechanism of their case management software to their stakeholders because it is considered commercially sensitive.

Black boxes can be dangerous.

By relying on a decision from a black box system without knowing the logic behind it, we risk a lack of accountability. In May 2016, a report claimed that risk assessment software used by judges during criminal sentencing across the US was biased against black people. Northpointe, the company that provides the software, refused to disclose the software’s decision mechanism, which could have been subject to hidden biases or unexpected errors that led to incorrect assessments. The effects of black boxes can be life-changing when used in situations that have significant consequences for unacceptable results, such as medical decision-making, criminal justice, and self-driving cars.

With the existence of black box systems comes the natural need for their interpretability. Self-driving car companies should be able to provide explanations of how their cars are safe. When our online credit or loan application is denied by an autonomous decision system, we should be able to access an explanation of why and how that decision was made. In situations like these, cracking the black box would allow us to discover unexpected biases or errors and ways to improve a system. In fact, this need for interpretability has recently been made official: the European Union’s General Data Protection Regulation requires that autonomous decision-making systems that produce legal effects or other similarly significant effects on an individual must provide human-intelligible explanations.

What does it mean if a system “is interpretable”?

A system is interpretable if we, humans, can understand it. Or, in other words, a system is interpretable if it is explainable with human-understandable terms.

Essentially, interpretability means having the ability to explain or present how a system works or why a system makes a decision in human-understandable terms.

What is interpretable machine learning?

Interpretable machine learning (IML) includes:

  1. A self-explained task-performing machine learning (ML) decision model that is constructed (learned) with its own explanation.
  2. A ML model that is designed as a tool to explain another system independent from itself.

Note that the target system is not restricted to machine learning systems, it can be anything as long as it can produce a quantified output given an input.

How do you make a system interpretable?

So, what exactly do we have to do to understand a system? There are various ways depending on the purpose of the desired explanation.

Figure 1 is an example of an existing approach provided by Saliency Map. The highlighted pixels indicate attributes from each pixel of the input image that lead the CNN image classification model to the decision that, e.g., the dog image should be labeled “dog.”

Figure 1. Explanation provided by Simonyan et al. (2013, Saliency Map). The Saliency Map for each image is provided right below the image.

Figure 2 shows an example provided by LIME of interpreting image classification by Google’s Inception neural network. The colored pixels explain why the given image is classified as “Electronic Guitar,” “Acoustic guitar,” or “Labrador.” Although the explanations from LIME and Saliency are similar in that they provide you with the attributing pixels that lead to each decision, LIME has an advantage over Saliency — it is system-agnostic and can be applied to any type of black box system even when we do not know the system’s exact form.

Figure 2. Explanation provided by Ribeiro et al. (2016, LIME)

L2X, another system-agnostic approach like LIME, provides instance-specific keywords to explain a CNN sentiment classification system for text-based movie review data from Large Movie Review Dataset, IMDB. In Figure 3, selected keywords such as “love,” “winning,” “captivating,” and “wonderful,” explain the model’s positive-predicted movie review, while keywords such as “unrealistic,” “boring,” “tired,” and “sorry” are selected for the negative-predicted movie review.

Figure 3. Explanation provided by Chen et al. (2018, L2X). The selected key words are highlighted in yellow.

L2X also provides instance-specific key patches containing 4 X 4 pixels to explain a CNN classification system trained on MNIST handwritten digit dataset (of the digits “3” and “8,” only). Figure 4 shows how the system recognizes the difference between 3 and 8.

Figure 4. Explanation provided by Chen et al. (2018, L2X) The selected patches are colored with red if the pixel is activated (white) and blue otherwise.

ACD provides a hierarchy of meaningful phrases and important scores for each identified phrase (blue for positive, red for negative) by an LSTM sentiment classification system trained on the Stanford Sentiment Treebank. In Figure 5, this hierarchy is used to explain why the system made an incorrect prediction. ACD correctly captured the sentiment of the positive phrase “a great ensemble cast,” and the negative phrase “n’t lift this heartfelt enterprise out of the familiar.” However, when the two phrases were combined, the system learned a positive sentiment and inaccurately predicted the negative phrase as positive.

Figure 5. Explanation provided by Singh et al. (2018, ACD) Blue is positive sentiment, white is neutral, red is negative. The bottom row displays scores for individual words in the sentence. Higher rows display important phrases identified by ACD, along with their scores, converging to the model’s (incorrect) prediction in the top row.

TCAV explains a neural net’s internal state in terms of human-friendly concepts. In Figure 6, the images (pictures of stripes and pictures of people wearing neckties) are sorted by their relation to the concept (in these cases, the concept of “CEO” and “Model Women,” respectively). The concept “CEO” that was learned by the neural network is explained by the concept-related stripe images, and the concept “Model Women” that was learned by the neural network is explained by the concept-related necktie images.

Figure 6. Explanation provided by Kim et al. (2017) The most and least similar pictures of stripes using “CEO” concept (left) and neckties using “model women” concept (right)

How do you evaluate interpretable machine learning approaches?

While the choice of evaluation varies depending on the specificity of the claim being made, there are common desiderata of an interpretable model.

Interpretability: to what extent the interpretations are comprehensible to humans.

For example, Chen et al. (2018) from the L2X example above asked humans on Amazon Mechanical Turk (AMT) to infer an output (sentiment) given an explanation consisting of instance-specific keywords (or sentences) used by a sentiment classification system for text-based movie reviews.

Singh et al. (2018) from the ACD example above asked eleven UC Berkeley graduate students with some level of machine learning knowledge two types of questions: which model they “think” has higher predictive accuracy and how much they “trust” a models output.

Accuracy: the measurement of the IML model’s competitive levels when the model works as both a task-performing model and a tool to explain itself. It is informed by traditional prediction performance measures such as accuracy score, F1 score, AUC score, etc.

Fidelity: how accurately the IML model mimics the behavior of a black box system. The model’s (locally or globally) complete description of itself is used to evaluate its approximation of the black box system. It is informed by traditional prediction performance measures with respect to the outcome of the black box system.

How we do it at Petuum.

An Explainable Artificial Intelligence Solution for Medical Patient Diagnosis

We have built an Explainable Artificial Intelligence (XAI) solution to improve the accurate and timely generation of medical patient diagnoses. It retrieves highly interpretable and referenced prediction sources, saving physicians significant time and decreasing medical errors.

Interpretable Machine Learning Approach using Deep Variational Information Bottleneck Principle

We have developed a novel system-agnostic interpretable machine learning approach using the deep variational information bottleneck principle. It extracts key cognitive features that are (1) maximally compressed about an input, and (2) informative about a decision made by a black box system on that input, which acts as an information bottleneck and is called the minimal sufficient statistic.

Figure 7. Illustration of our approach.

Here is an example of explanation provided by our approach. It selects five key words (in red) that convey the most compressive and sufficient information about the predicted sentiment (positive) given an IMDB movie review.

Figure 8. Explanation provided by our approach.

With the rise of sophisticated, autonomous AI decision system in vital applications, the need of interpretability has also received much attention in AI communities. IML aims to crack a black box system and provide an explanation of how a system works or why a system makes a decision in human-understandable terms. Most of the approaches we’ve introduced in this post provide simple explanations of why a particular black box decision is reached via feature attribution. It will be necessary to develop more sophisticated IML approaches to explain a specific query in more human-friendly terms.

Further Reading

Here are recommended survey papers about IML:

  • Lipton, Zachary C. “The mythos of model interpretability.” Queue 16.3 (2018): 30.
  • Doshi-Velez, Finale, and Been Kim. “Towards a rigorous science of interpretable machine learning.” arXiv preprint arXiv:1702.08608 (2017).
  • Guidotti, Riccardo, et al. “A survey of methods for explaining black box models.” ACM Computing Surveys (CSUR) 51.5 (2018): 93.

And here are the papers mentioned in this post:

  • [ACD] Singh, Chandan, W. James Murdoch, and Bin Yu. “Hierarchical interpretations for neural network predictions.” arXiv preprint arXiv:1806.05337 (2018).
  • [LIME] Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Why should i trust you?: Explaining the predictions of any classifier.” Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 2016.
  • [L2X] Chen, Jianbo, et al. “Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.” arXiv preprint arXiv:1802.07814(2018).
  • [Saliency Map] Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman.“Deep inside convolutional networks: Visualising image classification models and saliency maps.” arXiv preprint arXiv:1312.6034(2013).
  • [TCAV] Kim, Been, et al. “Tcav: Relative concept importance testing with linear concept activation vectors.” arXiv preprint arXiv:1711.11279(2017).