Machine learning (ML) explainability is a hot topic these days, particularly in the era of foundation models (FMs)/large language models (LLMs). The increasing emphasis on ML explainability reflects a broader recognition of the need to align AI technologies with human values, ethical principles, and societal norms. As AI systems become more embedded in everyday life, ensuring they operate transparently and justifiably becomes more of a societal imperative. This trend is likely to continue, with explainability playing a key role in the evolution of AI technologies and their integration into diverse facets of human activity.
In today’s Token, we will cover:
Why should I bother with Explainability? Trust. Debugging. Ethical Considerations and Bias. Interdisciplinary Collaboration. Regulations and Compliance.
How can I explain my models? Surrogate Models. Attention Visualization. Natural Language Explanation. Prompt Based Technique.
How can I implement these techniques? BertViz. Captum. LIME. CHAP.
Conclusion.
In this article, you will learn why you need to focus on model explainability followed by useful techniques and tools that will help you decipher these models and their predictions.
Why Explainability Matters for LLMs
You could be investing this time in improving the performance metrics for your model, optimizing your model so that you can serve your predictions faster, or just talking about your favorite Marvel character with your colleague. But, why on Earth should you be reading about explainability when your ML model is already, let’s say, 98% accurate? Actually, there are a few strong reasons for you to focus on explainability. Let us go through them.
Trust
Machine learning models especially LLMs are extremely convoluted and hence are opaque to users. Unlike simpler models where decisions can be traced back to specific features or rules, LLMs work through intricate networks of neurons and weights, making their decision-making process a black box*. But AI and ML models are becoming ubiquitous in our lives, and for them to be fully embraced by businesses and the public, people need to trust them. Explainability bridges the gap between human understanding and machine logic, making these systems more approachable and trustworthy. When users understand how a model arrives at its conclusions, they're more likely to trust its decisions and, by extension, the organizations that deploy them. Explainable AI helps users understand how the model works, its logic behind the decisions made and conclusions reached, and encourages their adoption.
*A black box in AI refers to a system whose inner workings are opaque, making its decision-making process unclear and unexplained.Debugging
LLMs are complex and can sometimes produce unexpected or erroneous outputs. Debugging LLMs can be challenging, especially when you have no idea how they arrived at a particular conclusion or output. When you discover why the model reached a particular conclusion using explainable AI tools, it becomes easier for you to tweak the dataset, the training procedure, or to fine-tune it so that the model's outputs are closer to the desired outcome. Furthermore, explainable AI tools can help you examine the source of bias in the model. Once identified, you can then take steps to eliminate those biases.
Ethical Considerations and Bias
As ML models, especially LLMs, are deployed in more critical applications –from healthcare to legal and financial services – the potential for biased outcomes or ethical mishaps increases. There's a growing recognition that these models can inadvertently perpetuate or even exacerbate biases present in their training data. Explainability allows stakeholders to identify and mitigate these biases, ensuring that AI systems act in an ethical and fair manner.
Interdisciplinary Collaboration
Explainability isn't just about mitigating risks; it's also about unlocking new potentials. The push for explainability fosters collaboration across fields –combining the expertise of data scientists, ethicists, legal experts, and domain specialists. This interdisciplinary approach enriches the development and deployment of AI systems, ensuring they meet a wider range of needs and considerations and encouraging innovative approaches.
Regulations and Compliance
Lastly, the power of LLMs has made them central to automation systems in critical industries such as healthcare and medicine, journalism, banking, and insurance. Regulations in certain parts of the world (GDPR in Europe) require or may soon require explainability to establish transparency, fairness and accountability.
Thus, the explainability of ML models, especially LLMs, is crucial for the wide adoption and smooth functioning of these models in production.
Explainability Techniques for LLMs
While explainable AI is a fairly old topic, it is still an area of active research. And, with the rise of LLMs and the ethical concerns associated with them, its popularity has skyrocketed. With conventional models like linear regression and tree-based models, explaining them was effortless compared to the LLMs. For instance, in a linear model, the coefficient of each feature determined its significance and helped in explaining the model output.

Image Credit: Zelros AI
For tree-based models like decision trees, random forest, xgboost, etc. you can compute feature importance which gives you an idea of how important the feature is to the trained model. It is computed by calculating the drop in error/loss after using the feature.
Meanwhile, for LLMs, explaining the model is not so straightforward. Below, we will discuss a few explainability techniques that can be employed for LLMs. Worth knowing!
Surrogate Models
Surrogate models use simpler, easily interpretable models to approximate and explain the predictions made by complex black box models. Surrogate models may include models like decision trees, linear models, or any other white-box model. This technique is used by LIME to generate explanations for a specific instance. To generate explanations for an instance, new data samples are created by applying minor changes to the original instance. The surrogate model is then trained on those samples to mimic the behavior of the original black-box model. Now, the trained surrogate model, which is explainable, can be used as a substitute for the original complex model to explain its predictions for the specific instance.
Attention Visualization
The attention value helps you identify the most important part of the input prompt. Visualizing these attention values in a heat-map or bipartite graphs reveals the attention patterns which in turn helps you understand how the model works i.e. how the model treats different words in the input.

In the image above, when processing the words, “the” and “cat” in the second phrase (after the first separator), you can see that the model is still paying attention to the respective word in the first part. This indicates that the model knows that the second phrase is describing the same cat encountered in the first phrase.
Natural Language Explanation
This approach involves training the language model with both the data and explanations. The trained model can then automatically generate explanations in natural language. Since these explanations provide additional context, it has the side effect of improved model accuracy. An example dataset for training the model with natural language explanation is provided below.
Text | Label | Explanation |
|---|---|---|
"This movie was amazing! I laughed all the way through." | Positive | The words "amazing" and "laughed" indicate positive emotions and experience. |
"The service was terrible, and the food was cold." | Negative | "Terrible" and "cold" directly express negative aspects, further emphasized by "and". |
"The book was interesting, but the ending was confusing." | Mixed | Positive adjective "interesting", but "confusing" suggests negativity about the ending. |
"The product worked well, but it arrived damaged." | Mixed | "Worked well" indicates positive functionality, but "damaged" implies a negative experience. |
"The instructions were unclear, and it took me hours to assemble the furniture." | Negative | "Unclear" and "hours" express negative experience directly. |
Prompt Based Techniques
You can also ask LLMs to explain their own predictions. You can design a prompt and ask the model to justify its predictions or to provide evidence. The image below demonstrates this technique.

Additionally, you can try contrastive prompting to peek into LLM’s internal knowledge base and its understanding of concepts. Contrastive prompting involves providing paired prompts and responses that are contrasting to each other and recording LLM’s response to each of them.
Here are a couple of contrastive prompts examples:
Text Classification:
Prompt: "Classify this text as positive or negative sentiment. A positive example is 'I love sunny days.' A negative example is 'I hate rainy weather.'"
Code Generation:
Prompt: "Write a Python function to calculate the Fibonacci sequence efficiently, similar to using memoization. Do not use a naive recursive approach that leads to excessive computations."
To explore more explainable AI techniques, you can also refer to this survey paper.
Tools for Explaining LLMs (BertViz, Captum, LIME, SHAP)
Some of the above techniques rely only on the input prompt and hence are easy to implement. For other techniques, you can either write your own code or use existing tools to achieve desired results. Since writing your own custom solution to implement explainable AI involves re-inventing the wheel and is outside the scope of this blog, we won’t dwell on it. In this section, we will discuss existing tools that help us explain LLMs.
BertViz
BertViz is an interactive tool that allows you to explore attention in the network, visualize the inner workings of language models like BERT, T2, GPT, etc. and offer valuable insights. You can run this tool in notebooks (jupyter or Google Colab) or as a python script. It works seamlessly with Hugging Face models.
There are three major views offered in BertViz:
Head View: Allows you to visualize attention in one or more attention heads in the same layer.
Model View: Allows you to inspect attention across all layers and heads.
Neuron View: Allows you to visualize individual neurons and shows how they are used to compute attention.

Image Credit: BertViz GitHub
Captum
Captum is an open-source interpretability tool designed for PyTorch. Captum helps users implement interpretability algorithms that can interact with PyTorch models. An advantage of using Captum is its rapid development. It recently introduced the LLM attribution functionality which makes it easy to apply the attribution algorithm to explain LLMs in text generation. This link contains an example (with code) for the same.
LIME
LIME (Local Interpretable Model-Agnostic Explanations) is a model-agnostic tool for interpretability. At the moment, it can explain any black box classifier with two or more classes. It is based on the idea of surrogate models and is used to explain individual predictions rather than the model itself. LIME builds local linear models around specific predictions, providing an easily interpretable explanation for individual instances.
SHAP
SHAP stands for SHapley Additive exPlanations. Inspired by game theory, it fairly distributes credit among features, explaining individual predictions and revealing which factors truly drive the model's output. Like LIME, SHAP is also a model-agnostic tool and hence can be used to explain predictions of any model.
Conclusion
Model metrics aren’t everything for LLMs. While good metrics are desirable, it’s equally important to focus on the explainability of the models. Neglecting explainability can lead to user churn, lawsuits, and a decline in the adoption or usage of your models. Moreover, explainable AI can aid in debugging your models by offering insights into their knowledge base and reasoning processes. It’s also a cornerstone for fostering trust, innovation, and practical applicability.
Surrogate models, attention visualizations, and natural language explanations offer tangible pathways to demystify the complex decision-making processes of LLMs, making them more accessible and understandable to users and developers alike. Techniques like contrastive prompting further enhance our ability to probe these models, revealing the underlying logic and biases in their responses. The advent of tools like BertViz, Captum, LIME, and SHAP has significantly lowered the barriers to implementing explainable AI, empowering users to not only interpret but also improve the reliability and fairness of LLMs.
As we continue to integrate LLMs into various facets of society, the pursuit of explainability will undoubtedly play a pivotal role in shaping their future development, ensuring they serve the greater good while minimizing unintended consequences.
Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍
Please give us feedback
Previously in the FM/LLM series:






