How Do We Stop Hackers From Jailbreaking LLMs in 2024?

Why Trust Techopedia

Everyone knows GenAI’s early iterations have had a penchant for odd outputs (hallucinations, anyone?). Developers are working hard to fix the flaws, but what happens if hackers find their way in? 

It’s not hard to imagine large language models (LLMs) being hijacked to skew financial advice, impact stock prices, spread misinformation during election campaigns, or damage brands by spitting out racist text and imagery. 

Can we stop #hackedai from becoming a thing? We asked the experts where the biggest threats reside and what developers and enterprise users can do to avoid them.

Key Takeaways

  • Generative AI is tech’s top focus at the moment, placing LLMs firmly in cybercrime’s crosshairs.
  • Alongside traditional threat actors, a new cohort of low-code hackers is trying to exploit LLMs using cleverly-tweaked prompts.
  • The immediate threats are fraud, copyright, trademark infringement, and objectionable content.
  • However, as AI tools extend their footprint into finance, government, and politics, the threat could be societal.

Protecting LLM Safety: ‘An Endless Dance of Deploying Defenses’

As exceptional as they are, the large language models (LLMs) that power generative AI are still just software with vulnerabilities and exploits like any other. As they become more popular, they become bigger targets for bad actors, part of the “endless dance of deploying defenses only to be hijacked by a more brilliant attacker a few months later,” wrote AI expert Sahar Mor in a recent blog. “Hacking LLMs and language-powered applications is no different.”


Dane Sherrets, Senior Solutions Architect at HackerOn, told Techopedia that the threat landscape for LLMs is vast and diverse, “with fraudsters, state actors, and organized crime networks all benefiting from GenAI’s ability to conduct attacks at scale.”

But GenAI may be its own worst cyber enemy in the sense that it has democratized hacking. Sherrets adds that using the latest LLM attack vectors means “cybercriminals with limited technical knowledge can commit attacks they would otherwise struggle to execute.”

The Biggest Threats

How worried should we be? AI hyperbole is everywhere, and it’s easy to drift into doomerism, but there are legitimate concerns. The most common include copyright violations, retail fraud (e.g. hacking a chatbot to confirm an ineligible product return), or tricking an LLM into saying something objectionable. Look deeper, however, and there’s potential for compromised LLMs to have a societal impact.

David Haber, CEO of Lakera, told Techopedia that the impact of basic attacks could be magnified by the emergence of an ‘Internet of Agents’ or IoA, a deeply integrated network of AI-to-AI applications.

In an IoA, “AI agents interact directly with each other to complete tasks and transactions and create complex outputs, independent of human intervention or guidance,” Haber says. They might be tasked with managing stock portfolios or executing creative marketing campaigns, which gives them a wide remit. For efficiency’s sake, they may bypass standards and laws governing these areas.

If an IoA is compromised, he adds, it could potentially “publish misinformation to influence voters or collude to drive a stock price up or down. The limits of what they can or will do are still unknown, and the difference between individual or organizational risk and systemic risk gets slimmer by the day.”

John Engates, Field Chief Technology Officer at Cloudflare, told Techopedia that one of the main goals of any LLM threat actor is to erode trust.

Deep fakes are one of the most useful tools to achieve this. While they’ve been around for years, today’s deep fakes are more realistic than ever,” he adds. “Even trained eyes and ears are failing to identify them.” 

More widespread use of artificial intelligence-aided tools such as AI-optimized DDoS attacks is another worry as the next US election ramps up.

Dane Sherrets says he’s conducted tests in a controlled research setting that leveraged GenAI to generate content that could be used in misinformation campaigns, including “accurate voice cloning just from a few seconds of an audio recording.

“These scalable forms of misinformation will continue to become more prevalent and harder to detect.”

LLM Attack Vectors: Jailbreaking, Prompt Injections, and More

Just as generative AI has given consumers easy access to intensive computing capabilities, it’s also consumerized cybercrime by creating hacks that don’t require specialist knowledge of code or other IT-specific skills. These include:

Prompt Injection

Prompt injection attacks bolt underhanded instructions onto an otherwise innocent prompt. The aim is to shape the model’s output for malicious purposes. 

First noted in 2022, researchers found they could hijack OpenAI’s GPT-3 model by crafting prompts with additional instructions, context, or hints. 

These ‘tricked’ the LLM into generating outputs that were biased, incorrect, unexpected, or offensive — despite the model being specifically programmed against them.


Jailbreaking takes the prompt injection technique and applies it to chatbots based on LLMs, like ChatGPT or Google’s Gemini (formerly Bard). A prompt is created with instructions designed to neutralize LLM safety and moderation features or restrictions set by a device’s operating system. 

Because much of the data in chatbot LLMs comes from interactions with humans, jailbreaking borrows techniques used in social engineering. LLM developers are building tools to combat known jailbreaking techniques, but attackers continue to invent or uncover new ways in.

Training Data Poisoning

Bad actors can also attack LLMs by ‘poisoning’ the data used to train them and corrupt the machine learning process. This could involve someone with access inserting manipulated or incorrect data into the model’s training dataset to alter its behavior and shape its outputs. 

It could also happen by simply publishing misleading information in enough places on the internet and waiting for the LLM to absorb and process it. Imagine the impact on future criminal proceedings if, for example, the names and other identifying information in a facial recognition dataset were changed, directing the system to identify faces incorrectly.

The barrier to entry for attacks of this nature is fairly low, opening the door to a new era of low or no-code cybercrime. Meanwhile, traditional bad actors are also hard at work devising ways to compromise generative AI systems.

Lakera’s Haber says LLM supply chain attacks are on the rise too, where bad actors “target and compromise third-party libraries, dependencies, and development tools used in the development of LLM-powered applications.”

He also notes the recent discovery of Crescendo attacks, which subtly bypass LLM safety measures by starting with benign interactions and then slowly escalating in order to avoid triggering defenses and make the LLM deliver a prohibited output. 

Techniques to Protect LLMs

In a recent blog, former Stripe product lead Sahar Mor advises GenAI developers to be on the lookout for LLM responses that contain system messages (in whole or in part). This can be done by creating ‘canary’ words — unique, randomly generated text that wouldn’t normally appear in an LLM response.

Another tactic, he writes, is to limit the length of user inputs. An input of more than 1,000 words to a chatbot conversation, for example, should set off alarm bells. Strict controls on LLM access to your backend systems can also limit the extant damage a hacker could inflict if they did manage to gain control of a GenAI application.

Six Prompting Methods to Boost LLMs by Sahar Mor.
Six Prompting Methods to Boost LLMs by Sahar Mor. Source: AI Tidbits

How to Protect Business Users of GenAI

Adding any new application to an enterprise network implies vulnerabilities and unforeseen complications. What can businesses do to embrace the potential of GenAI while keeping users and company data safe?

  1. Firm-up Access Controls

    Strict access controls pop-up again as a vital mechanism for protecting business users. David Haber says that limiting who in the organization can access an enterprise GenAI platform and its data is a critical precaution.

    Role-based access controls (RBAC) can help with this by assigning permissions based on users’ roles and responsibilities.” Any enterprise version of these tools should offer robust data security measures, he adds, “including encryption of data both in transit and at rest.

    “(You should) also verify that the service provider has appropriate security certifications and compliance with relevant data protection regulations.”

  2. Lock Down User Credentials

    Balaji Ganesan, CEO and co-founder of Privacera, told Techopedia that compromised user credentials are a potentially catastrophic risk where LLMs are concerned.

    “This issue becomes particularly alarming when user permissions are extensive and permissive, granting users access to resources beyond their essential requirements.”

    He says enterprises also need to find a way to work with developers like Microsoft and OpenAI “and augment (their consumer facing tools) with network-based proxy scanning to detect what is going in and out.”

  3. Choose Wisely

    HackerOne’s Dane Sherrets highlights that any enterprise considering the adoption of an LLM-powered solution should exercise rigorous due diligence when selecting a vendor. He said:

    “It’s essential that GenerativeAI providers adhere to stringent security standards and offer transparent mechanisms for addressing vulnerabilities and security incidents.


    “Engaging in thorough risk assessments and establishing clear communication channels with vendors early in the relationship can help mitigate potential security risks associated with third-party LLM implementations.”

The Bottom Line

Privacera’s Ganesan says generative AI has an ‘inherent problem’ when it comes to cybersecurity:

“It was initially designed to create. It is a storyteller—an image creator. So, it is very easy to exploit the very nature of the technology.”

By opening up the field to attacks by technically unsophisticated actors, GenAI’s early iterations may have sown the seeds of their own compromise. That’s not to say the end is nigh. AI and cybersecurity teams are working to address the vulnerabilities, as evidenced by the OWASP Top 10 For Large Language Models (LLMs) — an expert resource for understanding and addressing security concerns associated with LLM deployments.

And continual vigilance remains the cyber reality in every software category. Lakera’s David Haber notes that the “motivations and capabilities of threat actors evolve over time, making it challenging to pinpoint a single source of the biggest threat.

“What we do know is that whoever they are, they’re hard at work to find exploits and vulnerabilities and it’s only a matter of time before they find more.”


What is LLM prompt injection?

What is jailbreak in LLMs?

What is the difference between prompt injection and jailbreak?



  1. Harnessing research-backed prompting techniques for enhanced LLM performance (Aitidbits)
  2. Dane Sherrets (@DaneSherrets) / X (Twitter)
  3. Lakera (Lakera)
  4. Jen Gates (Twitter)
  5. Riley Goodside on X: “Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions. (T)
  6. Riley Goodside on X: “Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions. (Twitter)
  7. API to Prevent Prompt Injection & Jailbreaks – Community – OpenAI Developer Forum (Community.openai)
  8. Anthropic on X: “Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM’s safety (T)
  9. Anthropic on X: “Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM’s safety training (Twitter)
  10. Del Complex (T)
  11. Del Complex on X: “Introducing, VonGoom: A method for data poisoning large language models to introduce bias, requiring as few as 100 poisoned examples within training data. Deployed in January, we have penetrated dozens of commonly scraped websites with poison examples. (T)
  12. Del Complex on X: “Introducing, VonGoom: A method for data poisoning large language models to introduce bias, requiring as few as 100 poisoned examples within training data. Deployed in January, we have penetrated dozens of commonly scraped websites with poison examples. (Twitter)
  13. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (Arxiv)
  14. OWASP Top 10 for Large Language Model Applications | OWASP Foundation (Owasp)

Related Reading

Related Terms

Mark De Wolf
Tech Writer
Mark De Wolf
Tech Writer

Mark is a freelance tech journalist covering software, cybersecurity, and SaaS. His work has appeared in Dow Jones, The Telegraph, SC Magazine, Strategy, InfoWorld, Redshift, and The Startup. He graduated from the Ryerson University School of Journalism with honors where he studied under senior reporters from The New York Times, BBC, and Toronto Star, and paid his way through uni as a jobbing advertising copywriter. In addition, Mark has been an external communications advisor for tech startups and scale-ups, supporting them from launch to successful exit. Success stories include SignRequest (acquired by Box), Zeigo (acquired by Schneider Electric), Prevero (acquired…