How to Monitor LLMs: KPIs, Tools, and Adversarial Risks

Why LLM Monitoring Matters

Peter F. Drucker, in his famous book “The Effective Executive”, coined the phrase “What gets measured, gets improved”. This saying is as popular in the Machine Learning (ML) world as it is in the corporate world. For instance, to know that you need to work on the accuracy of the model, you first need to be rightly informed that the model's accuracy is not sufficient. And, to know that the accuracy is lacking, you need to measure it.

To improve the models, we need to gauge them across numerous facets. The monitoring metrics depend on the task at hand and overlap with those of conventional ML models. Depending on the task, you could still use metrics like the F1 score, accuracy, and precision to gauge the performance of LLMs but, in addition to these metrics, you will also need to take care of:

Safety measures: Filtering content to avoid spitting out biased and conflicting content
Protection from adversarial attacks
Interpretability

Failing to monitor LLMs could result in a tarnished reputation and might cause irrevocable damage to both the company using it and the company that made it. So, what should you know about monitoring large language (and traditional) models?

In today’s Token, we cover:

Turns out things can get nasty really quick with LLMs, how can I start monitoring my models and infrastructures?
Curated a list of open-source tools that solve some of the most pressing problems with LLM monitoring and observability.
What would be the right KPIs to measure?
My model metrics look good, but the model is still not performant. What might be the issue?
How do I know my users are actually benefitting from the model and improved metrics?
Adversarial attacks 😱
Conclusion

LLM Monitoring Tools (Open-Source)

In the ML world, 100s of new tools emerge every week. Not all of them are going to be useful for your use case. Below, we have curated a list of open-source tools that solve some of the most pressing problems with LLM monitoring and observability:

AllenNLP Interpret:
- A library for interpreting and visualizing LLM predictions, assisting in model explanation and debugging. It works for any model of your choice.
LangKit:
- An open-source toolkit for monitoring Large Language Models (LLMs). Features include assessing text quality and relevance, hallucinations check, sentiment and toxicity analysis.
BERTViz:
- Specifically designed for visualizing and interpreting BERT-based LLMs. Helps visualize attention in NLP Models (BERT, GPT2, BART, etc.).
SHAP (SHapley Additive exPlanations):
- A game theoretic approach to explain the output of any machine learning model. Allows users to use models from the transformers library by HuggingFace.
AI Fairness 360:
- An extensible open-source toolkit can help you examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle
Prometheus:
- An open-source monitoring toolkit for collecting and querying metrics from LLMs in real time.
Grafana:
- Integrates with tools like Prometheus and Elasticsearch to provide visualization and analysis of LLM metrics and logs.

KPIs and Metrics for Monitoring LLMs

Now that you have a good grasp of tools you could use to monitor and observe your models, what would be the right KPIs to measure? We have prepared a segregated list of KPIs for you to measure along with their purpose.

Bias and Fairness

Demographic Parity: Measures whether the LLMs output differs unfairly based on demographic attributes(age, sex, gender, etc.)
Expert Review: Getting an expert to identify bias in LLM’s output. It is a manual process.

Task-Specific Performance

Depending upon the downstream task you intend to use the LLM for, you could use any suitable metric like accuracy, f1 score, precision, and recall. LLMs share these metrics with other conventional ML models.

Language Fluency and Coherence

Perplexity: Measures the uncertainty the LLM has in predicting the next word, indicating fluency and grammatical correctness. Lower perplexity signifies better language modeling.
BLEU Score: It stands for BiLingual Evaluation Understudy Score. It measures how close the output is to a human-generated text. A higher score is better.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Similar to BLEU, but focuses on measuring how well the LLM captures the main ideas and information in the reference text.

Apart from model-specific metrics, you will also need to monitor resource metrics to ensure the smooth delivery of ML solutions. These metrics are the same as you would measure when deploying conventional ML models. Examples include memory usage, errors and logs, CPU usage, disk usage, network bandwidth, etc.

When Good Metrics Still Mean a Bad Model

There can be many different reasons for your model to output gibberish even when the metrics look good. Below, we will discuss a few of them.

The model might have picked up on noise.

Let’s say you want to train a model to classify emails as either atheist or Christian. If numerous emails originating from academic institutions (with the .edu domain) in the training set are atheist, the model will likely learn to classify emails originating from academic institutions as atheism.

The only way to eliminate such issues is to identify them using tools like LIME or SHAP and then remove noise from the training dataset so that the model doesn’t associate them with any specific labels.

Check for drift in the concept and data

With time, an individual’s behavior, preferences as well as trends changes. So a model that was producing great results a while back may not produce the same quality of results. Hence, it is important to re-train the models at regular intervals with newly collected data.

Measuring Real User Impact with A/B Testing

A perfect metrics dashboard doesn’t necessarily mean that the users love the service you provide/sell to them. It just means that whatever is deployed, is functioning as it should. In a lot of cases, the deployed solution may not be helping you achieve the business goal you intended to achieve or some minor improvement in accuracy might be resulting in bad user experience due to increased inference time. So how do you know whether the deployed solution is helping you achieve the intended goal or not? Well, by running A/B tests.

An A/B test in machine learning is a controlled experiment where two versions of a model (one could be a simple heuristics and the other could be LLM based solution) are compared to see which one performs better. It's like a scientific experiment for algorithms, allowing you to measure the real-world impact of changes you make to your model. Another significant advantage of conducting A/B tests is that it allows you to confidently make decisions: when you know that a change has been tested and loved by your end-users, you can be more confident in your decision.

It is important to note that the methodology for conducting A/B tests for LLMs is the same as that for conventional ML models. It includes the following steps:

Define your question
Choose your variants (the null and alternate hypothesis)
Split your data (challenger and control data)
Run the test
Analyze the results
Draw conclusions

Adversarial Attacks on LLMs

Finally, just like conventional ML models, LLMs are also susceptible to adversarial attacks. However, the nature of the attack on LLM is different from that of the conventional ML models. Some adversarial attacks are as follows:

Data Poisoning: It involves injecting adversarial examples in the training data to bias the model towards generating certain outputs.
Prompt Modification: It involves slightly manipulating the input prompt such that it generates a biased response.
Transfer attack: It involves identifying adversarial attacks that worked on a model and using them to attack another model with canonical architecture.

In order to tackle these adversarial attacks, you need to first identify them using explainable AI tools or interpretability tools (we will cover this topic in the upcoming Tokens). These tools help you understand the LLM's internal reasoning and decision-making processes which can assist in identifying potential biases and vulnerabilities to adversarial prompts. Then, a human expert can help you solve the issue.

Conclusion

Monitoring LLMs after deployment is a crucial task. Machine learning is an iterative process, hence the model needs to be updated and improved regularly, and monitoring plays a vital role in identifying areas that need improvements. In order to ease the process, there are several open-source tools like LangKit, SHAP, BertViz, etc. Sometimes, even when all the metrics in the monitoring dashboard look good, the model might still be underperforming. There could be two reasons for this: the model might have picked up noise as useful knowledge, or there might be a shift in concept or the data. Additionally, to know whether the deployed model is resulting in a positive user experience or not, you need to perform a statistical A/B test. Finally, LLMs are susceptible to adversarial attacks; hence, it is wise to use explainable AI tools to examine potential biases and vulnerabilities and fix them.

Overall, it’s a never-ending process 😉 What gets measured, gets improved!

Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍

Previously in the FM/LLM series:

Token 1.18: How to Monitor LLMs

Why LLM Monitoring Matters

LLM Monitoring Tools (Open-Source)

KPIs and Metrics for Monitoring LLMs

When Good Metrics Still Mean a Bad Model

Measuring Real User Impact with A/B Testing

Adversarial Attacks on LLMs

Conclusion

Reply

The Org Age of AI: A Collection of Enterprise AI Adoption Guides

Is your security team ready for AI coding agents? Join us on July 14🛡️

AI Concepts and Techniques in 2026: Memory, Inference, Fine-Tuning & Tokens