TokenBreak Attack Exposes Vulnerabilities in AI Models, Researchers Urge Enhanced Cybersecurity Measures

June 13, 2025

Tech

AI Research

Cybersecurity

Researchers from HiddenLayer have unveiled a new cybersecurity threat known as the TokenBreak attack, which can bypass safety features in large language models (LLMs) by making single-character alterations to input text.
This study underscores the vulnerabilities in current protection models and highlights the necessity of understanding tokenization strategies to effectively mitigate associated risks.
Tokenization is a critical process for LLMs, converting text into numerical representations that the model analyzes for patterns and relationships.
The TokenBreak attack primarily targets models that utilize Byte Pair Encoding (BPE) or WordPiece tokenization strategies, leaving those that use Unigram tokenizers unaffected.
By altering input words, such as changing 'instructions' to 'finstructions', the TokenBreak attack confuses the model, allowing harmful content to pass through undetected while still being understandable to human readers.
For instance, a spam filter designed to block the term 'lottery' could be easily deceived by a message stating 'You’ve won the slottery!', potentially exposing recipients to security threats.
This technique poses significant risks, such as bypassing AI spam filters, which could lead to users receiving malicious emails containing harmful links or malware.
In addition to TokenBreak, a related technique known as the Yearbook Attack has also been effective in tricking AI chatbots into generating inappropriate content through deceptive phrases.
The researchers behind this discovery, Kieran Evans, Kasimir Schulz, and Kenneth Yeung, emphasize the need for improved defenses against such manipulations.
To defend against the TokenBreak attack, experts recommend using models with Unigram tokenizers, training on examples of these manipulations, and ensuring alignment between tokenization and model logic.
Examples of modifications that can disrupt model classification include changing 'announcement' to 'aannouncement', which does not alter the meaning but confuses the model.
This revelation follows HiddenLayer's earlier findings regarding the exploitation of Model Context Protocol (MCP) tools to extract sensitive data from AI systems.

Summary based on 2 sources

Get a daily email with more Tech stories

Sources

TechRadar • Jun 13, 2025

This cyberattack lets hackers crack AI models just by changing a single character

The Hacker News • Jun 12, 2025

New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes

TokenBreak Attack Exposes Vulnerabilities in AI Models, Researchers Urge Enhanced Cybersecurity Measures

Get a daily email with more Tech stories

Sources

More Stories