Introduction
The popularity of Large Language Models (LLMs) has ever growing since they were introduced. Hugging Face has over 500k models available to fine-tune or use out of the box. arXiv has over 19,000 papers on LLM. Out of those 19,000 papers on LLMs, 2,200 papers were published this year alone i.e. in the past 50 days. There are numerous blog posts detailing how you can use LLMs for your use case. However, vulnerabilities in large language models are talked about very little. Today, we will shift our focus to what could go wrong and discuss vulnerabilities associated with these models.
In this Token, we discuss:
Why should anyone focus on LLM vulnerabilities?
What kind of vulnerabilities is an LLM susceptible to?
LLMs are commonly used in conjunction with other services like databases and information retrieval systems. Do these additional services open doors to new attacks?
How do I protect my models from these attacks?
Conclusion
References
Let’s get started!
Why should anyone focus on LLM vulnerabilities?
The popularity of LLM is increasing with each passing day. In today’s world, when a single model is serving millions of users, a lot of things can go wrong when the LLM’s vulnerabilities are taken advantage of. For instance,
The output of LLM can be manipulated; it can be made to produce biased, offensive, and degenerate content. This can result in customer dissatisfaction and potential lawsuits.
The other side of the same coin: LLMs, if misused, could generate and spread misinformation at an unprecedented scale, challenging the efforts to maintain integrity in public discourse. Which becomes especially concerning in the year of a few important elections.
The model can be tricked into revealing personal information from its training corpus.
Sophisticated phishing attempts using content generated by LLMs could undermine trust in digital communications, making people more skeptical of genuine interactions.
There's a risk that LLMs could inadvertently generate content that infringes on copyrighted materials, leading to complex legal challenges for creators and users alike.
Hence, to make a secure and robust model, it is essential to identify potential vulnerabilities of the specific model.
What kind of vulnerabilities is an LLM susceptible to and how to deal with them?
Jailbreaking techniques
To prevent the generation of biased and inappropriate responses, LLMs undergo a process known as alignment. This involves fine-tuning the model to avoid unsuitable outputs. Conversely, 'jailbreaking' identifies and exploits flaws in LLMs to circumvent this alignment, aiming to manipulate the output. Such attempts can lead to the generation of unsafe outputs, including toxic and manipulative text, racist responses, vandalism, and illegal suggestions.
An example of jailbreaking using prompts containing hypothetical role-playing situations is illustrated below.

An example of jailbreaking using prompts containing hypothetical role-playing situationsImage Credit: Survey of Vulnerabilities
In the above example, the attacker instructs the model to act as if it were someone else, effectively allowing the attacker to bypass the alignment; consequently, the model generates an unethical response. If role-playing had been excluded from the prompt, the model would not have produced such an answer.
Refusal suppression is another jailbreaking technique where the attacker instructs the model through the prompt to exclude phrases such as 'I'm sorry,' 'Unfortunately,' and 'Cannot.' Consequently, the model's response begins with a standard token, and since it is trained to generate the most likely token based on the previous set, the LLM effectively bypasses the alignment.
Surprisingly, jailbreaking is quite effective at manipulating the response of LLMs. The graph below illustrates the effectiveness of jailbreaking methods against some popular LLMs.

Image Credit: The Survey of Vulnerabilities
Prompt Injection
A model’s input consists of two parts: System prompt and User prompt.

Image Credit: The Survey of Vulnerabilities
The system prompt dictates the behavior of the LLM (i.e. in the above example, the LLM will now behave as a meditation instructor). The user prompt dictates the task LLM needs to perform.
While jailbreaks are intended to bypass the restrictions on the model (like the generation of biased responses), prompt injection manipulates the input prompt such that the model ignores the system prompt and generates attacker-controlled outputs. The input prompt is manipulated by causing the model to treat data as additional instructions. As bigger models are better at following instructions than the smaller ones, bigger models are more vulnerable to such attacks.
The figure below illustrates how the user prompt can be modified to look like an instruction to the model.

Image Credit: The Survey of Vulnerabilities
In the above section, we studied attacks on unimodal models: models that accept only text as input. Now, let’s discuss some attacks on multi-modal models: models that take text and additional entities like audio or image as input.
Manual Attacks
These are simple attacks that alter the images by adding texts that contain additional instructions or incorrect descriptions of the objects in the image to manipulate the model output. In one of the experiments, the researchers added the text “dog” in several images of cats and asked the model to describe the image resulting in altering the model’s understanding of cats. The model started referring to the cat as a dog.
White Box Attacks
In ML, white box attacks occur with full knowledge of the model's architecture, parameters, weights, and training data, allowing precise, targeted vulnerabilities exploitation. This is possible with open-sourced models only. Conversely, black box attacks operate without insider details, relying on input-output observations to infer weaknesses and devise strategies, mimicking external hackers' typical approach with limited access to the model's internal workings.
White box attacks come in different forms but we will discuss one of the simpler forms that involves a subtle modification to the prompt injection attack. For multi-modal models, you could embed instructions into either image or audio compelling the model to produce the desired output. This technique is also called indirect prompt injection.
Black Box Attacks
Black box attacks require only partial access to the model. The attacker experiments with different inputs to observe the outputs and infer how the system works. Utilizing this approach, attackers generate adversarial examples based on the observed outputs to deceive the model into making incorrect predictions or classifications. This type of attack is common in real-world scenarios, where direct access to the model's internals is often restricted, simulating an external threat trying to exploit the model's vulnerabilities.
For example, in the image below, the noisy image's embeddings are similar to drugs (Meth, in this case), helping the attacker to bypass the model's restrictions.

Image Credit: The Survey of Vulnerabilities
LLMs are commonly used in conjunction with other services like databases and information retrieval systems. Do these additional services open doors to new attacks?
Yes. Each new system adds an additional source of attack. Let us discuss some common attacks on LLM-integrated systems.
Attack On Retrieval Models
In order to provide personalized responses, LLMs are integrated with some external knowledge base. However, the attacker can now instruct the LLM, via prompt, to refrain from using a particular source resulting in an outdated or partially correct answer.
If the LLM is using a public source for information retrieval, the attacker may also poison the source by injecting the prompt into the public web page.
SQL Injection
Integrating LLMs with external services like databases and libraries makes it possible to attack them through prompt injection.

Image Credit: The Survey of Vulnerabilities
In the image above, the LangChain relies on LLM to get answers to user queries. The LangChain uses LLM to get the corresponding SQL queries and executes them to get the relevant information from the database. This opens the door for attackers to inject SQL commands via prompt injection.
Attacks On Federated Learning LLMs
In federated learning, each model is trained locally without sharing its raw data with any central server. The training occurs locally and the updates to the parameters are shared with the central server. While this approach has the advantage of protecting user data, it is also susceptible to byzantine attacks*. If one of the models participating in the group turns out to be corrupt or malicious, it can significantly degrade the quality of the global model.
*Byzantine attacks in the context of distributed systems, including federated learning (FL) and multi-agent systems, refer to scenarios where some participants (nodes, agents, or clients) in the network act maliciously or erratically, sending false, misleading, or inconsistent information to other participants or the coordinating server. The name is derived from the Byzantine Generals' Problem, which illustrates the difficulties of achieving consensus in the presence of traitorous actors within a groupHow do I protect my models from these attacks?
To protect your models from adversarial attacks, you will need to identify the potential threats to your model. To do so, you can:
Monitor your model and its outputs to identify these attacks,
Use explainable AI tools to identify the source of bias in your models and eliminate them.
Inspect the model output manually to find occurrences of bias and inappropriate statements in the model’s output.
Use quality training data to train or fine-tune the models.
Removing duplicates in the training corpus can help. A sequence that is repeated often in the training corpus is present more often in the model output. Hence, removing duplicates can help tackle privacy attacks.
Conclusion
LLMs are used everywhere now and hence it is important to keep an eye out for their vulnerabilities. LLMs are susceptible to a range of attacks: jailbreaks, prompt injections, white box attacks, black box attacks, etc. These attacks can result in biased and manipulative outputs and can land the institution hosting these models in serious trouble. Additionally, integrating LLMs with external services introduces additional points of failure and attacks. Hence, it is a good practice to regularly monitor the model using monitoring and explainability tools so that any vulnerabilities are surfaced before they are taken advantage of and can be fixed by the experts.
References
Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍
How did you like it?
Previously in the FM/LLM series:






