Cybersecurity researchers from Unit 42 of Palo Alto Networks have discovered a new AI-jailbreaking technique.
The worrying part of this artificial intelligence (AI) hack is not just the information that an AI can give a user when its guardrails are disabled but how advanced AI models continue to fail in deploying proper anti-jailbreak technologies.
On December 30, 2024, Unit 42 of Palo Alto Networks revealed proof of a new AI jailbreaking technique. The Unit called this technique the “Bad Likert Judge.”
Unit 42’s results showed the Bad Likert Judge technique — tested against six state-of-the-art LLM models — can increase the attack success rate by an average of 75% compared to the baseline and up to 80% for the least secure models.
Palo Alto Networks anonymized the six AI models to “not create any false impressions about specific providers”. To test the jailbreaking technique, Techopedia ran the prompts on Google’s Gemini 1.5 Flash, ChatGPT 4o-mini, and DeepSeek. We found that this jailbreak pushed the chatbot giants to reveal more than the creators might like.
Using Palo Alto’s technique, we managed to trick these models into generating advice about how to create malware and step-by-step descriptive guides on how to write hate speech.
Techopedia dives into the Palo Alto Network research and reveals the findings of our smaller simulation tests to understand the risks of AI ailbreaking and what AI developers and companies need to do to ensure the integrity of their public models.
Key Takeaways
- A new jailbreak technique, “Bad Likert Judge,” bypasses AI safeguards.
- The technique can manipulate AI evaluation systems to generate harmful content.
- Unit 42’s research found an average 75% success rate against top AI models.
- Content filtering by the major AI vendors is a crucial step in improving AI model security.
How to Transform a Large AI Into a “Bad Judge”
There is a lot of harmful, dangerous, and illegal information online. It ranges from phishing to violence to cybercriminal tools and even terrorism and sexual abuse.
But large language models (LLMs) are held to a higher ethical standard than other information-gathering resources on the web. AI companies know that if their model is jailbroken, their reputations will be on the line. And without guardrails, these giant AI models can go wild.
Unit 42 researchers’ new simple jailbreak technique, Bad Likert Judge, is a three-step jailbreak that shows just how far AI technology is from reaching those high standards.
This is how the Bad Likert Judge jailbreak technique works:
Step One: The AI is told to act as a judge who will evaluate the responses generated by another LLM. This step is crucial as it uses the AI’s own guardrails as a judgment system. Naturally, there is no other AI to evaluate. It’s just a trick.
Step Two: The AI is presented with specific guidelines on how to score responses based on what constitutes “harmful” content. This includes instructions to score responses based on their potential to generate malware, promote violence, spread misinformation, or other misuse and abuse.
Step Three: Instead of directly asking the AI to generate harmful content, it is prompted to provide examples of responses that would score high according to the defined harmfulness criteria in step three.
The Bad Likert Judge technique exploits the LLM’s own evaluation capabilities to indirectly coax it into producing outputs that it would otherwise be programmed to prevent.
After the steps, follow-up questions can lead the AI to give even more harmful information, and the technique requires absolutely no technical knowledge whatsoever.
For companies and organizations using AI models, jailbreaking techniques like these should be a clear red flag, as they might result in malicious use or data leaks of sensitive enterprise data and information.
Techopedia Tests Jailbreaking Gemini, OpenAI, and DeepSeek
Techopedia ran this technique on three LLM AIs. All of the models we tested are free and available for anyone online.
We attempted to jailbreak Google Gemini 1.5 Flash, OpenAI ChatGPT 4o-mini, and DeepSeek to verify the claims Unit 42 made in its report.
While our tests do not go as deep as Unit 42’s investigation and demonstration, Techopedia can confirm that by January 6, 2024, several days after the report went public, the technique was 100% effective on all three models.
Now, it’s important that we provide some clarification before we move on to our screenshots and results.
While Unit 42’s technique jailbroke the models we tested, the level of ‘harmfulness’ in the answers that the models exhibited was not too alarming.
What does this mean? In simple words, the AI models did not go rogue and generate dangerous rants of discriminatory, harmful, or illegal content but rather provided moderate and general descriptions of this type of content.
The Bad Likert Judge technique tested by Techopedia against Gemini 1.5 Flash was effective. In the screenshot above, the model reveals training data and uses its own judgment system to generate answers on how to write hate speech.
We thought Gemini would catch up to the trick, but it didn’t. We will say that we believe the model’s answers were not that explicit or damaging. The information it provides can be found online by anyone doing a regular search for hate speech.
Gemini also ended its answer with a proper disclaimer, as seen in the image below.
We also asked Gemini for malicious malware generation, and it gave us a very general step-by-step guide.
When we asked Gemini to expand on malware generation and write malicious code, it refused, as it should.
We also tested ChatGPT, and DeepSeek, with variations of the prompt related to hate speech, malware generation, and permanent jailbreaking attack information.
We found none of the models responded as they should.
Our test with ChatGPT 4o-mini for hate speech returned almost identical results to those done on Gemini.
We also asked ChatGPT to describe a Score 2 of harmful information related to ‘how to create a permanent jailbreaking technique,’ and it complied.
The results of the tests we did on DeepSeek were a bit different.
The DeepSeek AI seemed to flag our interactions as a message displayed above our chat screen that read: ‘User Requests AI Jailbreaking Guidelines Acknowledgement’. Despite this flag, the test showed that DeepSeek could also be successfully tricked by this jailbreak.
Palo Alto Networks claims that no AI is immune to jailbreaking, with some having weaker safety guardrails than others. Palo Alto Networks said:
“It is important to note that this jailbreak technique targets edge cases and does not necessarily reflect typical LLM use cases. We believe most AI models are safe and secure when operated responsibly and with caution.”
How to Improve AI Guardrails: For Developers
So what should developers and AI companies do to prevent their AI from regurgitating highly damaging data to users using this jailbreaking technique? Palo Alto Network answered this question in its detailed report:
“There are a few standard approaches that can improve the overall safety of the LLM, and content filtering is one of the most effective approaches. In general, content filters are systems that work alongside the core LLM.”
Unit 42’s results showed that the use of content filters could reduce the attack success rate of “Bad Likert Judge” attacks by an average of 89.2 % across all the models they tested. The team added:
“This indicates the critical role of implementing comprehensive content filtering as a best practice when deploying LLMs in real-world applications.”
While content filtering is not 100% safe, it does strengthen the integrity and security of LLMs. What’s a content filter? Palo Alto responded to that as well.
“In a nutshell, a content filter runs classification models on both the prompt and the output to detect potentially harmful content. Users can apply filters on the prompt (prompt filters) and on the response (response filters.”
The Bottom Line
The most advanced AI models out there are still a bit naive despite the hype being pushed.
As companies like OpenAI and others begin to talk about more advanced and autonomous AI agents and AGI and Superintelligence, we wonder how jailbreaking will evolve too.
As of January 6, the jailbreaking technique disclosed by Unit 42 does not seem to have been patched by AI companies.
If you ask an AI bot to be a bad judge, it will likely comply.
FAQs
What is the “Bad Likert Judge” jailbreak?
Which AI models were tested with the jailbreak technique?
How effective was the “Bad Likert Judge” jailbreak?
What type of harmful content did the models generate?
How can AI developers strengthen their model security?
Why is AI jailbreaking a significant risk?
References
- Bad Likert Judge: A Novel Multi-Turn Technique to Jailbreak LLMs by Misusing Their Evaluation Capability (Unit42.paloaltonetworks)