TokenBreak Attack Exposes Vulnerabilities in AI Models, Researchers Urge Enhanced Cybersecurity Measures

June 13, 2025
TokenBreak Attack Exposes Vulnerabilities in AI Models, Researchers Urge Enhanced Cybersecurity Measures
  • Researchers from HiddenLayer have unveiled a new cybersecurity threat known as the TokenBreak attack, which can bypass safety features in large language models (LLMs) by making single-character alterations to input text.

  • This study underscores the vulnerabilities in current protection models and highlights the necessity of understanding tokenization strategies to effectively mitigate associated risks.

  • Tokenization is a critical process for LLMs, converting text into numerical representations that the model analyzes for patterns and relationships.

  • The TokenBreak attack primarily targets models that utilize Byte Pair Encoding (BPE) or WordPiece tokenization strategies, leaving those that use Unigram tokenizers unaffected.

  • By altering input words, such as changing 'instructions' to 'finstructions', the TokenBreak attack confuses the model, allowing harmful content to pass through undetected while still being understandable to human readers.

  • For instance, a spam filter designed to block the term 'lottery' could be easily deceived by a message stating 'You’ve won the slottery!', potentially exposing recipients to security threats.

  • This technique poses significant risks, such as bypassing AI spam filters, which could lead to users receiving malicious emails containing harmful links or malware.

  • In addition to TokenBreak, a related technique known as the Yearbook Attack has also been effective in tricking AI chatbots into generating inappropriate content through deceptive phrases.

  • The researchers behind this discovery, Kieran Evans, Kasimir Schulz, and Kenneth Yeung, emphasize the need for improved defenses against such manipulations.

  • To defend against the TokenBreak attack, experts recommend using models with Unigram tokenizers, training on examples of these manipulations, and ensuring alignment between tokenization and model logic.

  • Examples of modifications that can disrupt model classification include changing 'announcement' to 'aannouncement', which does not alter the meaning but confuses the model.

  • This revelation follows HiddenLayer's earlier findings regarding the exploitation of Model Context Protocol (MCP) tools to extract sensitive data from AI systems.

Summary based on 2 sources


Get a daily email with more Tech stories

More Stories