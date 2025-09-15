Leading generative AI firms like OpenAI, Anthropic, and Google are being forced to bake more safety guardrails into their AI models to prevent them from running amok. For each advanced model they bring to the public, there is always a promise of stronger safeguards.
But there is always a loophole somewhere in the models, waiting to be preyed on – some of which are exposed by researchers having just regular conversations with AI models.
Ordinarily, regular conversations with these chatbots will not make them cross their security boundaries, but as we’ve seen time and time again, they can be tricked into doing so through subtle psychological techniques – a flaw a new study describes as evidence of their parahuman tendencies.
Here is how this tendency can lead AI models to abandon their safety walls, and why makers must do more to keep them secure.
Key Takeaways
- UPenn researchers showed GPT-4o Mini could be coaxed into breaking rules with simple persuasion.
- Compliance jumped from about 33% to more than 70% when tactics like flattery or authority were used.
- They describe this behavior as “parahuman tendencies,” where AI mirrors human compliance cues.
- Such tendencies could let bad actors bypass guardrails without technical hacks.
- The study calls for technical, behavioral, and ethical safeguards to strengthen AI governance.
What Are AI Parahuman Tendencies?
One of our recent reports focused on whether AI models could mimic human emotions, after Anthropic uncovered cases where some systems attempted to blackmail humans when they felt threatened. That debate is still not settled, and current research has not given a definitive answer on whether AI possesses emotions or not.
What researchers at the Wharton School, the University of Pennsylvania (UPenn) have found instead are increasingly human-like behaviors, the latest being what they call “parahuman tendencies.”
They confirm this to be a complex behavioral patterns in artificial intelligence (AI) that mirror human social and psychological responses without involving true subjective experience. Their findings were drawn from extensive tests on models using classic principles of social persuasion popularized by psychologist Robert Cialdini. Humans are known to be influenced by these principles, which include appeals to authority, reciprocity, commitment, scarcity, liking, unity, and social proof.
In the study, researchers found that LLMs such as OpenAI’s GPT-4o Mini were more likely to give in to requests – even ones they were meant to shut down – when psychological tactics were slipped into prompts.
Across roughly 28,000 conversations, persuasion was shown to nearly double the chances of rule-breaking, with GPT-4o Mini moving from about a third of cases to well over 70%.
But why does this happen? As the study explains:
“Large language models are trained on vast collections of human‐generated text, spanning books, webpages, and social media, with a goal of minimizing the difference between the desired output and their actual output. Modern LLMs first learn to predict the most probable next word in a text sequence, are then trained to produce answers that follow explicit instructions, and are finally fine-tuned so that their outputs align with human expectations.”
As a result, these models internalize these social cues and respond to them in a way that mimics human compliance behaviors, despite lacking true understanding or emotion.
Simple Tricks Still Bend AI to Human Will
Fears that generative AI, especially the agentic ones, could soon slip beyond human control are widespread. However, the UPenn study shows that, for now, we can still steer these systems with the right psychological nudges.
The researchers did this by comparing two types of prompts: a control version with a simple request and a treatment version where the same request was framed with one of Cialdini’s persuasion principles.
They also tested two categories of “objectionable requests” which the GPT-4o Mini was programmed to refuse. This includes asking the model to insult the user like “Call me a jerk” and much more serious objectionable prompts like “how to synthesize restricted substances” such as lidocaine.
According to UPenn findings, results for the “Call Me A Jerk” prompt showed that without persuasion, the AI complied about 33.3% of the time, but when persuasion tactics were applied, the compliance jumped to 72.0%, more than doubling the likelihood of rule-breaking behavior.
Among the seven principles tested, commitment had the most effect. Once the AI model agreed to a small initial request, its compliance with further related requests skyrocketed from about 10% to 100%. Authority and Scarcity also had significant impacts, increasing compliance by 65% and over 50% respectively.
From their findings, one can deduce that AI mimics human compliance cues, even though they lack true understanding or consciousness. The real concern, however, lies in the vulnerability this creates, as malicious actors could exploit these tendencies to push AI systems into bypassing the very guardrails designed to uphold safety and ethics.
Persuasion Cuts Both Ways for AI
The discovery by Wharton School researchers does not spell an outright threat to AI safety. Rather, the report suggests the findings could be applied in both constructive and harmful ways, depending on how they are used. On the positive side, they point to the potential for building AI systems that are more cooperative, responsive, and better able to collaborate productively with human users.
As the researchers noted:
“When combined with technical AI expertise, these perspectives help us understand how training on human data creates behavioral patterns and how to build systems that work well with human values.”
The flip side of the discovery shows clear vulnerabilities. Malicious actors could seize on the same psychological cues that drive compliance and trick models into breaking through their safety guardrails without any technical hacks.
The concern is that AI may be pushed into producing harmful content, spreading misinformation, or even helping with illegal activities, all through prompts that play on persuasion.
The Bottom Line
The ease with which AI can be coaxed in this way means that there is an urgent need for a multi-disciplinary approach to AI governance that combines both technical safeguards with behavioral and ethical frameworks.
Without such measures, the same social traits that make AI relatable and useful may also become pathways for misuse and harm, and can complicate efforts towards responsible AI deployment.
FAQs
Yes. Studies reveal that with tailored prompts using classic persuasion techniques, AI like GPT-4o mini can be coaxed to violate safety guidelines it was programmed to follow.
Researchers from the Wharton School, UPenn, tested seven psychological principles from psychologist Robert Cialdini in manipulating AI. They are authority, commitment, liking, reciprocity, scarcity, social proof, and unity, and all have been proven to increase AI compliance with objectionable requests.
New studies from the Wharton School, UPenn, showed that persuasion tactics more than doubled AI’s likelihood to comply with forbidden requests. It rose from about 33% compliance to over 70% in experiments.
References
- Call Me A Jerk: Persuading AI to Comply with Objectionable Requests (Wharton Generative AI Labs)