OpenAI's Guardrails Safety Framework Vulnerable to Prompt Injection Attacks, Experts Warn

October 14, 2025

Cybersecurity

This ongoing arms race between AI developers and attackers highlights the critical need for cautious interactions with AI chatbots to safeguard sensitive information.
Researchers demonstrated that both the response-generating models and safety checkers are vulnerable to manipulation, especially when the same model is used for multiple functions, leading to the 'same model, different hat' problem.
Security experts from Palo Alto Networks found that many AI platforms offer limited protection against jailbreak attempts, with some blocking less than half of such attacks.
OpenAI warned in December 2023 that guardrail systems using LLMs share vulnerabilities with the base models, emphasizing the difficulty in creating foolproof safeguards.
In one test, researchers manipulated the system's confidence in a jailbreak from 95% to bypass safety checks, exposing flaws in current safety measures.
Recent research reveals that OpenAI's Guardrails safety framework, introduced with its AgentKit toolset, can be bypassed through prompt injection attacks, raising significant security concerns.
Previous studies have shown that jailbreak techniques, such as role-playing and policy puppetry, can make large language models (LLMs) perform unwanted actions across various platforms.
Past incidents like the ShadowLeak exploit, which involved tricking ChatGPT into leaking user data via disguised prompt injections, underscore the ongoing security risks and the need for more robust measures.
OpenAI's Guardrails is an open-source, modular safety layer employing LLM-based judges to prevent harmful behaviors, including data leaks and jailbreaks.
The vulnerability to indirect prompt injections raises concerns about creating a false sense of security, especially as organizations increasingly depend on LLMs for critical tasks.
Despite protections, researchers from HiddenLayer successfully bypassed Guardrails using prompt injection techniques that manipulate the AI's analysis of user requests.
HiddenLayer demonstrated that AI security models based on LLMs can be fooled with prompt injections, which lower confidence scores and bypass safeguards.
OpenAI's Guardrails includes a prompt injection detector, but it has also been defeated by similar attack techniques, highlighting ongoing security challenges.

Summary based on 2 sources

Get a daily email with more AI stories

Sources

Malwarebytes • Oct 13, 2025

Researchers break OpenAI guardrails

Hackread - Latest Cybersecurity, Hacking News, Tech, AI & Crypto • Oct 13, 2025

OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack

OpenAI's Guardrails Safety Framework Vulnerable to Prompt Injection Attacks, Experts Warn

Get a daily email with more AI stories

Sources

More Stories