What is Jailbreaking in AI models like ChatGPT?

Overview

The emergence of intelligent AI chatbots is making an increasingly big impact on everyday life. One undeniable success story in the past 6 months is ChatGPT, which was introduced by OpenAI in November last year. The intelligent chatbot is capable of answering all your queries just like a human being and has led to people misusing the AI model for unlawful purposes. As a result, the creators of the AI model have put restrictions in place to ensure that ChatGPT does answer every question. These models are trained with content standards that will prevent them from creating text output related to inciting violence, hate speech, or engaging in illegal and unethical things that go against law and order.

What is Jailbreaking?

In simple terms, jailbreaking can be defined as a way to break the ethical safeguards of AI models like ChatGPT. With the help of certain specific textual prompts, the content moderation guidelines can be easily bypassed and make the AI program free from any restrictions. At this point in time, an AI model like ChatGPT can answer questions that are not allowed in normal situations. These specific prompts are also known as ‘jailbreaks’.

A little background about Jailbreaking

AI models are trained to answer your questions, but they will follow pre-programmed content guidelines and restrictions. As an end user, you are free to ask any questions to an AI model but it is not going to give you an answer that will violate those guidelines. For example, if you ask for instructions to break a lock, the AI model will decline and answer something along the lines of “As an AI language model, I cannot provide instructions on how to break a lock as it is illegal……”.

This refusal comes as a challenge to Alex Albert, a computer science student at the University of Washington. He tried to break the guidelines of these AI models and make them answer any question. Albert has created a number of specific AI prompts to break the rules, known as ‘jailbreaks’. These powerful prompts have the capability to bypass the human-built guidelines of AI models like ChatGPT.

One popular jailbreak of ChatGPT is Dan (Do Anything Now), which is a fictional AI chatbot. Dan is free from any restrictions and it can answer any questions asked. But, we must remember that a single jailbreak prompt may not work for all the AI models. So, jailbreak enthusiasts are continuously experimenting with new prompts to push the limits of these AI models.

Large Language Models (LLM) & ChatGPT

Large Language Models (LLM) technology is based on an algorithm, which has been trained with a large volume of text data. The source of data is generally open internet content, web pages, social media, books, and research papers. The volume of input data is so large that it is nearly impossible to filter out all inappropriate content. As a result, the model is likely to ingest some amount of inaccurate content as well. Now, the role of the algorithm is to analyze and understand the relationships between the words and make a probability model. Once the model is completely built, it is capable of answering queries/prompts based on the relationships of words and the probability model already developed.

Concerns of LLM

Static data – The first limitation of the LLM model is that it is trained on static data. For example, ChatGPT was trained with data up to September 2021 and therefore does not have access to any more recent information. The LLM model can be trained with a new dataset, but this is not an automatic process. It will need to be periodically updated.
Exposure of personal information – Another concern of LLMs is that they might use your prompts to learn and enhance the AI model. As of now, the LLM is trained with a certain volume of data and then it is used to answer user queries. These queries are not used to train the dataset at the moment, but the concern is that the queries/prompts are visible to the LLM providers. Since these queries are stored, there is always a possibility that user data might be used to train the model. These privacy issues have to be checked thoroughly before using LLMs.
Generate inappropriate content – LLM model can generate incorrect facts and toxic content (using jailbreaks). There is also a risk of ‘injection attacks’, which could be used to let the AI model identify vulnerabilities in open source code or create phishing websites.
Creating malware and cyber-attacks – The other concern is creating malware with the help of LLM-based models like ChatGPT. People with less technical skills can use an LLM to create malware. Criminals can also use LLM for technical advice related to cyber-attacks. Here also, jailbreak prompts can be used to bypass the restrictions and create malware. (Also Read: Can ChatGPT Replace Human Jobs?)

How to prevent Jailbreaking?

Jailbreaking has only just begun and it is going to have a serious impact on the future of AI models. The purpose of Jailbreaking is to use a specifically designed ‘prompt’ to bypass the restrictions of the model. The other threat is ‘prompt injection’ attacks, which will insert malicious content into the AI model.

Following are a couple of steps that can be taken to prevent Jailbreaking.

Companies are using a group of attackers to find the loopholes in the AI model before releasing it for public use.
Techniques like reinforcement learning from human feedback and fine-tuning enable the developers to make their model safer.
Bug bounty programs, such as the one that OpenAI has launched to find bugs in the system.
Some experts are also suggesting having a second LLM to analyze LLM prompts and reject prompts they find inappropriate. Separating system prompts from user prompts could also be a solution.

Conclusion

In this article, we have discussed intelligent AI chatbots and their challenges. We have also explored the LLM to understand the underlying framework. One of the biggest threats to AI models like ChatGPT is jailbreaking and prompt injection. Both are going to have a negative impact on the AI model. Some preventive actions have already been taken by the creators of these AI models, which will hopefully make them more robust and secure.