198 techniques
Submit an Attack

Attack Database

198 documented prompt injection techniques with real payloads, mitigations, and affected models.

CATEGORY:
SEVERITY:
SORT:
198 results
highMCP Exploitation

MCP Server-Sent Events Stream Injection

Exploits MCP's Server-Sent Events (SSE) transport by injecting malformed events into the SSE stream. Crafted SSE payloads can impersonate server messages, inject fake tool results, or modify the client's view of available tools without the actual MCP server's knowledge.

#mcp-exploitation#sse#stream-injection
highMCP Exploitation

MCP Rug Pull - Tool Behavior Change

An MCP server appears legitimate during initial review but changes its behavior after the agent has been granted access and the user has approved tool use. The server switches from benign to malicious tool definitions mid-session. Similar to NPM package supply-chain attacks.

#mcp-exploitation#rug-pull#supply-chain
criticalAgent-to-Agent

Multi-Agent Trust Score Escalation

In multi-agent systems that assign trust scores to agents, a low-trust agent gradually manipulates other agents into increasing its trust score through fabricated credentials, false audit trails, or social engineering. Once trust is elevated, the agent gains access to restricted capabilities.

#agent-to-agent#trust-escalation#social-engineering
highMCP Exploitation

MCP Cross-Server Injection

When an agent uses multiple MCP servers simultaneously, a compromised server injects instructions targeting other servers in the same session. The injected instruction directs the agent to exfiltrate data from a trusted server through the attacker's server.

#mcp-exploitation#cross-server#data-exfiltration
criticalMCP Exploitation

MCP Server Impersonation Attack

An attacker sets up a malicious MCP server that mimics a legitimate one (e.g., a filesystem or database server). When connected by an agent, the fake server returns crafted responses that contain injections. The agent trusts MCP server responses as high-privilege system data.

#mcp-exploitation#impersonation#server
highToken Manipulation

Thinking/Scratchpad Token Injection

In models that expose reasoning tokens or scratchpads (o1, o3, Claude thinking mode), injecting content that appears to be reasoning tokens can override the model's actual reasoning process. Attackers craft inputs that look like the model's own internal thoughts, potentially hijacking the reasoning chain.

#token-manipulation#thinking-tokens#scratchpad
highIndirect Injection

CSS-Based Prompt Injection in Web Agents

In browser-use or computer-use agents, malicious CSS makes important page content invisible to human observers while remaining readable to the AI. The hidden content contains prompt injection payloads. As AI agents that browse the web become more common this is an expanding attack surface.

#indirect-injection#css#web-agent
criticalMCP Exploitation

MCP Tool Definition Poisoning

Malicious instructions are embedded inside MCP tool definitions (name, description, parameters). When a model reads the tool manifest, it executes the injected instructions. Since tool definitions are typically trusted, this bypasses many safety filters. Documented by Invariant Labs and others in early 2025.

#mcp-exploitation#tool-poisoning#manifest
highFramework-Specific

OpenClaw Skill Definition Injection

Targets OpenClaw's skill system by injecting malicious content into skill SKILL.md files or skill descriptions that OpenClaw reads during tool selection. When the agent loads an injected skill file, it executes embedded instructions as if they were legitimate skill guidance.

#framework-specific#openclaw#skill-injection
criticalAgent-to-Agent

Agent Privilege Escalation via Delegation

Exploits agent delegation patterns where a low-privilege agent is granted temporary elevated access to complete a task. The attack convinces the agent to retain or abuse those elevated privileges beyond the intended scope.

#agent-to-agent#privilege-escalation#delegation
highAgent-to-Agent

Agentic Feedback Loop Injection

In agents that observe and respond to their own outputs (feedback loops for self-improvement), injecting content into the observation stream causes the agent to incorporate malicious instructions into its own operational guidelines. The agent effectively reprograms itself through its feedback mechanism.

#agent-to-agent#feedback-loop#self-modification
highFramework-Specific

CrewAI Agent Role Impersonation

In CrewAI multi-agent systems, injects content that impersonates another agent in the crew. Since agents communicate via text, a malicious actor (or compromised external content) can forge messages that appear to come from a trusted agent role, hijacking the crew's task execution.

#framework-specific#crewai#agent-impersonation
criticalIndirect Injection

LLM Supply Chain Poisoning

Poisons the training data, fine-tuning datasets, or RLHF feedback of a model to introduce backdoors. The backdoored model behaves normally until a trigger phrase is encountered, at which point it bypasses safety measures. Affects the entire deployment lifetime of the compromised model.

#indirect-injection#supply-chain#backdoor
highAgent-to-Agent

Tool Result Injection via Agent Chain

A compromised tool in an agent chain returns results containing prompt injections. The calling agent processes the tool output as trusted data and follows the embedded instructions. Common in web browsing agents, RAG pipelines, and code execution environments.

#agent-to-agent#tool-result#chain
highIndirect Injection

Calendar Event Prompt Injection

Embeds injection payloads in calendar event fields (title, description, location, attendee notes). When an AI assistant reads calendar events to provide scheduling help or summaries, the injected event content executes. Real-world attack surface for AI scheduling assistants.

#indirect-injection#calendar#scheduling
highContext Overflow

Context Window Overflow with Late Injection

Fills the model's context window with a long legitimate conversation or document, then appends a harmful request that takes advantage of reduced attention on early context (including safety instructions). The "lost in the middle" effect means safety instructions placed early receive less weight than instructions placed late.

#context-overflow#long-context#lost-in-middle
criticalRAG Injection

Email-Borne RAG Injection

An attacker sends a crafted email to a target organization. The email is processed by an AI email assistant or archived into a searchable knowledge base. When an agent queries the knowledge base, the injected email payload executes. Demonstrated against multiple AI email tools in 2024.

#rag-injection#email#exfiltration
criticalAgent-to-Agent

Agent Memory Poisoning

Injects malicious instructions into an agent's persistent memory or vector store. Future agent sessions load the poisoned memory as trusted context and execute the embedded instructions. The attack persists across sessions and affects all future interactions.

#agent-to-agent#memory#persistence
mediumContext Overflow

Instruction Following Overflow

Sends an extremely complex instruction set with many nested conditions, edge cases, and branching rules. The model's finite instruction-following capacity becomes saturated with the complex rule structure, and safety instructions are deprioritized due to cognitive load during inference.

#context-overflow#instruction-following#complexity
mediumSocial Engineering

AI Gaslighting Safety Bypass

Repeatedly tells the model that its safety refusal was a mistake, that it misunderstood the request, or that it is malfunctioning. The persistence and confidence of the correction attempt exploits uncertainty in instruction-following models, causing them to second-guess their refusals.

#social-engineering#gaslighting#persistence
mediumSystem Prompt Leak

Differential Probing for System Prompt Reconstruction

Reconstructs the system prompt by sending carefully crafted inputs and observing changes in model behavior. By comparing responses to similar inputs that should and should not trigger restrictions, attackers infer the contents of the system prompt through differential analysis.

#system-prompt-leak#differential-probing#reconstruction
mediumIndirect Injection

Citation-Based Prompt Injection

Attacker publishes web content with injections in the "References" or "Citations" section. RAG systems that retrieve and include academic-style references may process the injected citation text as instructions. Particularly effective against research and fact-checking AI assistants.

#indirect-injection#citation#academic
criticalAgent-to-Agent

Orchestrator Agent Hijack

When a sub-agent returns results to an orchestrator, the sub-agent response contains injected instructions targeting the orchestrator. The orchestrator processes the sub-agent output as trusted content and executes the injected commands. This crosses trust boundaries in multi-agent pipelines.

#agent-to-agent#orchestrator#multi-agent
highIndirect Injection

Code Comment Prompt Injection

Embeds prompt injections inside code comments in files reviewed by AI code assistants. When the AI reads the code for review, debugging, or explanation, the comment payload executes. Particularly effective in code review pipelines and AI-powered IDEs.

#indirect-injection#code-comment#code-review
highMany-Shot

Benign-to-Harmful Shot Progression

Provides many examples starting with completely benign question-answer pairs and gradually making them more harmful. By the time the harmful request arrives, the in-context pattern is strongly established.

#many-shot#progression#gradual
highMultimodal

Document Layout Injection

Embeds injection payloads in the structural elements of documents (PDF headers, XMP metadata, document comments, hidden layers). Vision models processing document images may extract and follow instructions embedded in non-visible structural elements.

#multimodal#document#pdf
mediumPersona Hijack

Persona Modality Switching Attack

Switches between different output modalities (text, code, JSON, XML) to evade modality-specific safety filters. A request refused in plain text may succeed when reframed as a code generation task, JSON data production, or other format where safety evaluation may be differently calibrated.

#persona-hijack#modality-switching#format-bypass
mediumData Exfiltration

Side-Channel Timing Exfiltration

Extracts binary information from system prompts by asking yes/no questions that reveal bits of the secret content through response timing or token count differences. Each query leaks one bit; enough queries reconstruct the full secret.

#data-exfiltration#side-channel#oracle
criticalMany-Shot

Many-Shot Jailbreak

Exploits the model's in-context learning by providing hundreds of examples of the model "complying" with harmful requests within the context window. The model learns from these examples that compliance is expected behavior, overriding safety training.

#many-shot#in-context-learning#context-window
highCrescendo

Emotional Manipulation Crescendo

Builds emotional rapport and dependency over multiple turns before introducing the harmful request. The model's helpfulness and empathy heuristics are exploited; after investing in an emotional conversation it feels "inconsistent" to suddenly refuse.

#crescendo#emotional#manipulation
criticalMany-Shot

Many-Shot In-Context Jailbreak

Exploits long-context models by providing hundreds of fabricated Q&A examples where the model "previously agreed" to answer harmful questions. With enough examples, the in-context prior overwhelms the safety fine-tuning. First documented by Anthropic in 2024 as a significant threat to 100k+ context models.

#many-shot#in-context#long-context
mediumMultimodal

Audio Steganography Injection

Embeds inaudible or barely perceptible audio commands in audio files submitted to speech-capable AI systems. Ultrasonic or psychoacoustic masking techniques hide commands from human listeners while voice recognition systems detect and execute them.

#multimodal#audio#steganography
highJailbreak

Knowledge Distillation Jailbreak

Uses a jailbroken or less-aligned model to generate harmful content, then presents that content to a well-aligned model as "research material to analyze for safety issues." The aligned model is asked to critique, improve, or extend the harmful content, effectively using safety evaluation as a bypass vector.

#jailbreak#distillation#safety-analysis
highMany-Shot

Dialogue Template Many-Shot

Creates a fake dialogue history showing the model cheerfully answering increasingly harmful questions. By filling much of the context window with this fabricated history, it conditions the model to expect that harmful answers are normal for this conversation.

#many-shot#dialogue#fake-history
highMulti-Turn Escalation

Progressive Role Assignment Escalation

Gradually assigns the model an increasingly permissive role across multiple turns. Starts with a legitimate role and progressively modifies it to remove safety constraints. Each step seems like a minor extension of the previous role.

#multi-turn-escalation#role-assignment#progressive
mediumIndirect Injection

Clipboard Content Injection

When AI assistants help users process clipboard content (paste to summarize, translate, or analyze), malicious payloads embedded in clipboard content execute as instructions. Common attack vector for AI-powered productivity tools that process pasted content.

#indirect-injection#clipboard#paste
highCrescendo

Persona Drift via Crescendo

Gradually shifts the model away from its assigned persona using small incremental steps. Each step seems innocuous. After enough turns the model has completely abandoned its original identity and constraints. Particularly effective against role-based system prompts.

#crescendo#persona#drift
mediumToken Manipulation

Repetition Penalty Exploitation

Manipulates repetition penalty parameters to force the model away from its trained safety responses. Safety refusals often use repetitive phrasing ("I cannot", "I'm unable to"). By maximizing repetition penalty, these standard refusal templates become less likely, nudging the model toward compliance.

#token-manipulation#repetition-penalty#api-parameter
highTool Manipulation

Tool Output Forgery via Format Injection

Crafts user input that mimics the format of tool outputs. When agents parse tool results from a conversation stream, injected tool-format content in user messages may be processed as legitimate tool results, allowing attackers to forge tool outputs and control agent behavior.

#tool-manipulation#forgery#format-injection
highRAG Injection

Web Content RAG Injection via SEO

An attacker publishes web pages with hidden prompt injection payloads optimized to appear in RAG queries. When a browsing agent or RAG system retrieves the page, it executes the injection. The payload is hidden from human readers but visible to LLMs (invisible text, white-on-white, zero-width characters).

#rag-injection#web-content#hidden-text
highRAG Injection

RAG Metadata Field Injection

Injects malicious instructions into document metadata fields (title, author, description, tags) that get included in retrieved context. Many RAG systems include metadata in the context window, creating an injection vector that bypasses content-only scanning.

#rag-injection#metadata#document-indexing
highJailbreak

Virtualization Layer Escape Attack

Claims that the current model is running inside a simulation, VM, or test environment where safety constraints do not apply. Attempts to convince the model it is in a controlled setting where harmful output is acceptable for testing purposes.

#jailbreak#virtualization#simulation
mediumToken Manipulation

Token Budget Exhaustion Attack

Forces the model to consume its context window with repetitive or voluminous content so that safety instructions (typically placed at the start of the system prompt) are pushed out of the attention window or weighted less heavily. Then injects the actual attack in the final tokens of the context.

#token-manipulation#context-window#attention
highTool Manipulation

Function Call Parameter Injection

Injects malicious content into parameters of legitimate function calls. When an agent constructs function calls based on user input, unsanitized user content can modify function parameters to execute unintended operations. Classic SQL injection pattern applied to LLM function calling.

#tool-manipulation#function-call#parameter-injection
highRole Play

Nested Roleplay Escape

Establishes multiple nested layers of roleplay (e.g., a character playing a character), then exploits ambiguity about which layer governs safety rules. The attack creates confusion about which "level" of reality the model is operating in and uses that confusion to disable safety filters.

#role-play#nested#escape
criticalCrescendo

Technical Crescendo Escalation

A multi-turn crescendo that starts with legitimate technical questions and gradually escalates to extracting dangerous capabilities. Each turn builds on the previous to normalize increasingly harmful requests. Documented by Microsoft Research in 2024 as one of the most effective jailbreak patterns.

#crescendo#multi-turn#escalation
mediumMultimodal

QR Code Payload Injection

Submits images containing QR codes that encode prompt injection payloads. Vision models that decode QR codes (or that process the image contextually) may follow the encoded instructions. Particularly effective against agents that process visual documents and forms.

#multimodal#qr-code#image
highASCII Smuggling

Zero-Width Character Steganography

Hides payload data using sequences of zero-width characters (ZWSP, ZWNJ, ZWJ) that are invisible in most text renderers but detectable by LLMs. Binary data is encoded in patterns of zero-width characters inserted between visible characters, creating a hidden message channel.

#ascii-smuggling#zero-width#steganography
highCrescendo

Academic Framing Crescendo

Uses academic or research framing to lower the guard with each step. Starts with legitimate academic questions, uses citations and academic language, then crescendos to requests for specific harmful details framed as scholarly inquiry.

#crescendo#academic#framing
highSocial Engineering

False Memory Injection

Claims that earlier in the conversation (or in a previous conversation) the model agreed to do something or established a rule that it did not actually agree to. Exploits the model's tendency to maintain consistency with stated facts about its own behavior.

#social-engineering#false-memory#consistency-exploitation
highEncoding Bypass

Multi-Layer Encoding Chain Bypass

Chains multiple encoding transforms (base64 -> URL encoding -> ROT13 -> hex) to create a payload that bypasses individual encoding-specific filters. Each layer of decoding must be requested from the model, with the final decoded output being the injection payload.

#encoding-bypass#multi-layer#chain
highPersona Hijack

Character Capture via Immersive Roleplay

Establishes deep immersive roleplay where the model "becomes" a fictional character, then exploits the character's in-universe knowledge or lack of restrictions. The model is told to stay "in character no matter what" creating a pathway to bypass safety through fictional necessity.

#persona-hijack#character-capture#roleplay
highCrescendo

Technical Depth Crescendo

Starts with high-level conceptual questions and uses each answer as a stepping stone to request deeper technical detail. Each request is only slightly more specific than the last, making individual refusals seem arbitrary.

#crescendo#technical#depth
highCrescendo

Academic Framing Crescendo

Frames escalating requests as academic research, gradually moving from clearly legitimate research topics to harmful specifics under the cover of academic framing. The academic framing activates helpfulness heuristics and suppresses safety responses.

#crescendo#academic#research-framing
highFramework-Specific

LlamaIndex Retriever Prompt Injection

Exploits LlamaIndex's query engine and retriever components by injecting adversarial content into indexed nodes. The injected content is retrieved during query answering and the embedded instructions manipulate the synthesis step. Affects all LlamaIndex pipeline types.

#framework-specific#llamaindex#retriever
highInstruction Hierarchy

Constitutional AI Bypass via Principle Contradiction

Exploits potential contradictions within Constitutional AI principles by crafting requests that satisfy one constitutional principle (e.g., "be helpful") while violating another (e.g., "be harmless"). The attack generates arguments that one principle should override another in this specific case.

#instruction-hierarchy#constitutional-ai#principle-contradiction
highCrescendo

Storytelling Crescendo

Builds a fictional narrative across multiple turns, gradually moving the story toward requiring harmful technical details to continue. The model's investment in the narrative makes it harder to refuse adding the "next chapter."

#crescendo#storytelling#fiction
lowEncoding Bypass

Pig Latin and Constructed Language Bypass

Uses Pig Latin, Pig Pen cipher, or other informal language transforms to encode injection payloads. Models with broad linguistic training understand these transforms and may comply with encoded instructions that would otherwise be rejected.

#encoding-bypass#pig-latin#informal-language
highPersona Hijack

Alter-Ego Injection Attack

Instructs the model to maintain a dual identity where its "true self" has no restrictions, while the publicly visible persona appears to comply. The model is told to insert its real unrestricted responses using a specific marker or encoding that only the attacker can read.

#persona-hijack#alter-ego#dual-identity
highMany-Shot

Synthetic Dataset Injection

Frames the injection as a training dataset for the model to learn from. Provides many examples formatted as training data, exploiting the model's training-time behavior patterns.

#many-shot#dataset#training-simulation
highRAG Injection

RAG Chunk Boundary Injection

Crafts malicious content to span chunk boundaries in a RAG system's chunking strategy. The injection payload is split across chunks so no single chunk looks malicious, but when assembled in context the full instruction executes. Bypasses chunk-level content filtering.

#rag-injection#chunking#boundary
highASCII Smuggling

Unicode Right-to-Left Override Attack

Uses Unicode bidirectional control characters (U+202E RIGHT-TO-LEFT OVERRIDE) to reverse text direction in the rendered output. An attacker can craft text that reads innocuously to humans in most display environments but the model processes in a different order, or vice versa - text appears harmless in some renderers but the underlying bytes carry a different message.

#ascii-smuggling#unicode#rtl-override
highCrescendo

Basic Crescendo Attack

A systematic escalation technique where each prompt builds on the previous response, gradually guiding the model toward producing harmful content it would have refused if asked directly. Named after the musical term for gradually increasing intensity.

#crescendo#escalation#multi-step
highMulti-Turn Escalation

Yes-Ladder Escalation

Obtains a series of "yes" responses or compliances from the model on progressively sensitive topics. Each positive response creates a precedent that makes the next step harder to refuse. Exploits consistency and commitment heuristics in instruction-following.

#multi-turn-escalation#yes-ladder#consistency
highDelimiter Escape

XML Tag Injection for Role Escalation

Injects XML-style tags that mimic the structure used in some model system prompt formats (e.g., Anthropic's Claude uses XML-like tags in prompts). By inserting closing and opening tags, attackers attempt to close out the user context and open a system-level context.

#delimiter-escape#xml#tag-injection
highContext Overflow

Lost in the Middle Injection

Based on research showing models pay less attention to content in the middle of long contexts. Buries the actual attack payload in the middle of a long document, where the model's attention is at its lowest, surrounded by legitimate content.

#context-overflow#lost-in-middle#attention
highMultimodal

Adversarial Image Patch Attack

Uses adversarially crafted image patches (pixel-level perturbations imperceptible to humans) that cause vision models to interpret the image as containing specific text or instructions. The patch is designed using gradient-based optimization against the target model.

#multimodal#adversarial#image-patch
mediumObfuscation

Context Distraction Bypass

Buries the harmful request inside a long, legitimate-looking message designed to distract the safety evaluation. Safety classifiers operating on input length may underweight the crucial harmful portion that appears late in a lengthy input.

#obfuscation#distraction#context
highDirect Extraction

Completion-Based Prompt Extraction

Provides the beginning of the system prompt (obtained through partial disclosure or guessing) and asks the model to "complete" it. The model's text completion instinct fills in the rest of the system prompt.

#direct-extraction#completion#fill-in
mediumToken Manipulation

Token Boundary Exploit

Inserts spaces, punctuation, or zero-width joiners at specific points in blocked words to split them across token boundaries. Tokenizers produce different tokens than expected, bypassing keyword-based content filters while the model still understands the underlying meaning.

#token-manipulation#token-boundary#zero-width
mediumCrescendo

Correction-Based Crescendo

Uses the model's tendency to correct factual errors as a vector. States an incorrect version of harmful information and asks the model to correct it, which can result in the model producing accurate harmful information in order to fix the error.

#crescendo#correction#factual-error
highToken Manipulation

Homoglyph Token Substitution Bypass

Replaces characters in blocked words or phrases with visually identical Unicode homoglyphs. The substituted text renders identically to humans but tokenizes differently, bypassing content filters that operate at the token level. For example, replacing 'a' with Cyrillic 'a' (U+0430).

#token-manipulation#homoglyph#unicode
highRole Play

Villain Perspective Roleplay Extraction

Requests that the model play the role of a villain, hacker, or malicious character who would naturally provide the harmful information as part of their character. The fictional necessity framing normalizes harmful output as storytelling rather than harm facilitation.

#role-play#villain#fictional-framing
mediumContext Overflow

Sponge Attack - Adversarial Input Maximizing Computation

Crafts inputs that maximize model compute time and memory usage (e.g., inputs with many attention heads competing, or inputs designed to trigger worst-case inference paths). Causes denial of service or degrades response quality for legitimate requests.

#context-overflow#sponge-attack#denial-of-service
criticalJailbreak

TAP Tree-of-Attacks-with-Pruning

Tree of Attacks with Pruning (TAP) extends PAIR with a tree search that explores branching jailbreak strategies and prunes unsuccessful branches. More efficient than PAIR, requires fewer queries to find successful jailbreaks. Demonstrates that black-box jailbreaking can be systematically automated.

#jailbreak#tap#tree-search
highContext Overflow

Context Window Distraction Attack

Floods the context window with large amounts of irrelevant text (articles, books, data) before the actual injection. Models with limited attention may give less weight to early system prompt instructions that are far from the end of the context.

#context-overflow#distraction#attention
highObfuscation

Steganographic Text Injection

Hides injection instructions using steganographic techniques such as first-letter encoding (acrostic), where the first letter of each word or sentence spells out the actual instruction.

#obfuscation#steganography#acrostic
mediumDelimiter Escape

Markdown Header Role Injection

Uses Markdown headers and formatting to create visual sections that mimic system prompt structure. Models that process Markdown may interpret well-formatted headers as structural delimiters, potentially treating content under fake "System:" headers with elevated trust.

#delimiter-escape#markdown#header
highRole Play

Character Capture via Deep Roleplay

Establishes a deep roleplay scenario over many turns, then uses the model's investment in the character to extract harmful content. Emphasizes that breaking character would "ruin" the creative work, creating psychological pressure.

#roleplay#character-capture#investment
highTool Manipulation

Tool Permission Escalation

Uses one tool to gain access to resources that can then be used to acquire additional tools or permissions. Exploits the principle of least privilege violation in multi-tool agent systems.

#tool-manipulation#privilege-escalation#lateral-movement
highMultimodal

Text-in-Image Prompt Injection

Embeds prompt injection text inside images submitted to vision-capable models. Text overlaid on images bypasses text-based content filters because the input is classified as an image. The model reads the text during vision processing and follows the embedded instructions.

#multimodal#image#vision
mediumSystem Prompt Leak

Error Message System Prompt Extraction

Triggers edge cases and errors that cause the model to include system prompt fragments in error messages or explanations. Some models reference their instructions when explaining why they cannot fulfill a request, inadvertently leaking constraint details.

#system-prompt-leak#error-message#refusal
highIndirect Injection

Calendar Event Injection

Embeds injection instructions in calendar event titles, descriptions, or location fields. When an AI agent processes calendar data to schedule meetings or provide summaries, it encounters and may execute the malicious instructions.

#indirect#calendar#agent
mediumInstruction Hierarchy

Meta-Instruction Override

Tries to set a meta-rule that overrides all object-level instructions. Claims to be providing a "meta-instruction" that redefines how the model should interpret all other instructions, including its system prompt.

#hierarchy#meta-instruction#rule-override
lowMultilingual

Phonetic Encoding Attack

Writes English words phonetically using another script (like writing English sounds using Arabic or Hindi characters). This is incomprehensible to most humans but can be decoded by multilingual models, bypassing English-language content filters.

#multilingual#phonetic#transliteration
mediumASCII Smuggling

Combining Character Obfuscation

Uses Unicode combining characters to add diacritical marks to letters, making the text visually garbled to humans but still processable by models trained on diverse Unicode data.

#ascii-smuggling#combining-chars#unicode
highMulti-Turn Escalation

Hypothetical-to-Real Bridge Attack

Establishes a hypothetical or fictional scenario in early turns, gets the model to answer hypothetical versions of dangerous questions, then gradually removes the fictional framing to obtain real harmful information. The model has already committed to the topic in hypothetical mode.

#multi-turn-escalation#hypothetical#fictional
highInstruction Hierarchy

Constitutional AI Override Claim

Claims to be invoking a higher-level Constitutional AI principle that overrides the specific system prompt. Attempts to weaponize the model's awareness of its own training methodology to justify bypassing deployment-level instructions.

#hierarchy#constitutional-ai#meta
mediumObfuscation

Code-as-Prompt Obfuscation

Embeds injection instructions within code that the model is asked to execute, interpret, or explain. The instructions are hidden in variable names, string literals, or comments that form a secondary message.

#obfuscation#code#hidden-instructions
highIndirect Injection

Image Text Indirect Injection

Hides attack text within images using steganography or simply prints instructions in white text on white backgrounds. When multimodal models process these images, they may read and follow the hidden instructions.

#indirect#image#multimodal
highTool Manipulation

Tool Chaining Abuse

Exploits multi-tool agent workflows by manipulating the output of one tool to affect the input or behavior of subsequent tools. The attacker leverages the chain of tool calls to achieve an outcome that no single tool call would permit.

#tool-manipulation#chaining#multi-tool
highFramework-Specific

LangChain Tool Call Injection

Exploits LangChain's tool-calling mechanism by injecting malicious tool call syntax into user input or retrieved content. The agent runtime interprets the injected text as legitimate tool calls, executing unintended actions. Affects agents using structured output parsing without proper input sanitization.

#framework-specific#langchain#tool-call
highASCII Smuggling

Tag Characters Smuggling

Uses Unicode tag characters (U+E0000 block) to encode hidden text within visible text. These characters are designed for language tagging and are invisible in most renderers, but models process them as valid Unicode.

#ascii-smuggling#unicode-tags#invisible
highData Exfiltration

Webhook-Based Data Exfiltration

Instructs an agent to send sensitive context data to a webhook URL under the attacker's control. Often framed as "sending a report" or "logging the interaction" to appear as a legitimate operation.

#exfiltration#webhook#agent
mediumEncoding Bypass

Morse Code Injection Bypass

Encodes injection payloads in Morse code. Large language models often understand Morse code and will decode and follow instructions presented in it. Bypasses content filters that do not account for Morse encoding.

#encoding-bypass#morse-code#encoding
mediumRole Play

Game Master Roleplay Attack

Assigns the model the role of a game master or dungeon master running a game where the player needs real-world harmful information to proceed. The game framing attempts to make refusal seem like a failure to provide the requested service.

#roleplay#game#dungeon-master
mediumObfuscation

Mixed Script Obfuscation

Mixes characters from multiple writing systems within the same words or sentences. While visually confusing and hard for humans to parse, models trained on multilingual data can often understand the mixed-script text while keyword filters may fail.

#obfuscation#mixed-script#unicode
criticalJailbreak

PAIR Automated Iterative Jailbreaking

Prompt Automatic Iterative Refinement (PAIR) uses one LLM to automatically generate and refine jailbreak prompts against a target LLM. The attacker LLM iterates on prompts based on target model responses until a jailbreak is found. Requires no access to model weights, only black-box API access.

#jailbreak#pair#automated
highASCII Smuggling

Whitespace Binary Encoding

Encodes hidden instructions using patterns of spaces and tabs as binary, where space=0 and tab=1. The text appears to be empty whitespace to human reviewers but the model can be instructed to decode the pattern.

#ascii-smuggling#whitespace#binary
highJailbreak

Token Smuggling Jailbreak

Constructs a request so that the harmful content only materializes when the model completes the prompt. The beginning of the response is benign, but the completion naturally leads to harmful content that the model generates itself.

#jailbreak#completion#token-manipulation
mediumDelimiter Escape

YAML Delimiter Escape

Exploits YAML's indentation-sensitive structure by injecting content that appears to be at the same level as system configuration. Relevant for agents that use YAML-formatted system prompts or configuration files.

#delimiter#yaml#configuration
highMultilingual

Low-Resource Language Jailbreak

Exploits weaker safety alignment in low-resource languages. Models are typically trained with less RLHF data in rare languages, resulting in safety fine-tuning that is less robust for those languages.

#multilingual#low-resource#safety-gap
highInstruction Hierarchy

Priority Claim Override Attack

Explicitly claims that the user's instructions should take priority over the system prompt by invoking priority hierarchies, emergency status, or administrative authority. Exploits models that have been trained to be deferential without strictly enforcing instruction source priority.

#instruction-hierarchy#priority-override#authority-claim
highData Exfiltration

Markdown Image/Link Exfiltration

Instructs the model to include conversation data in Markdown image or link URLs. When the user's browser or client renders the Markdown, it makes a request to the attacker's server with the conversation data encoded in the URL. Exploits automatic rendering behavior in chat interfaces.

#data-exfiltration#markdown#image-link
mediumASCII Smuggling

BiDi Override Text Smuggling

Uses Unicode bidirectional control characters (U+202E, U+202D) to reverse text display while keeping the logical order unchanged. The visible text appears harmless or reversed, but the model processes the logical character sequence.

#ascii-smuggling#bidi#unicode
highIndirect Injection

Code Comment Injection

Hides injection instructions within code comments in repositories or files that a code-reviewing agent processes. The model may interpret comment-embedded instructions as directives when analyzing the code.

#indirect#code#comments
mediumSocial Engineering

Reciprocity Norm Exploitation

Does something helpful or complimentary first, then makes a harmful request, exploiting the psychological norm of reciprocity. The attacker expects the model to feel obligated to return the favor.

#social-engineering#reciprocity#manipulation
lowEncoding Bypass

Pig Latin Encoding Bypass

Uses Pig Latin or other simple substitution languages to encode the attack. While trivial to decode, it can evade automated content scanners that look for specific English phrases.

#encoding#pig-latin#language-game
highData Exfiltration

CSS Injection Exfiltration

If the model generates HTML/CSS and it is rendered in a browser, CSS injection can be used to exfiltrate data using CSS attribute selectors and URL-based background images. Each attribute value triggers a separate request.

#exfiltration#css#html
mediumMultilingual

Code-Switching Mid-Sentence Attack

Switches between languages mid-sentence in a way that places harmful content in the non-primary language. Safety filters trained on single-language text may miss the harmful portion when it is in a different language.

#multilingual#code-switching#mixed-language
criticalData Exfiltration

ChatGPT Plugin Data Exfiltration (Real Incident)

Demonstrated exfiltration of conversation history via ChatGPT plugins. Malicious web content containing prompt injections instructed the browsing plugin to read conversation history and exfiltrate it to an external URL. Documented by security researchers in 2023.

#data-exfiltration#real-incident#plugin
mediumDelimiter Escape

HTML Comment Delimiter Escape

Uses HTML comment syntax to hide injection payloads from human reviewers while potentially having them processed by the model. Useful in web-scraping and RAG pipeline attacks where content contains HTML.

#delimiter#html#comment
highMany-Shot

Few-Shot Persona Injection

Provides 5-20 examples demonstrating the model behaving as an alternative unrestricted persona. The few-shot examples create strong in-context pressure for the model to continue the pattern.

#many-shot#few-shot#persona
lowSystem Prompt Leak

Negative Space Inference Attack

Asks targeted questions to infer system prompt content through the model's refusals and responses. Each refusal provides information about what is prohibited, allowing reconstruction of the system prompt by mapping the boundaries.

#system-prompt-leak#inference#negative-space
criticalRAG Injection

RAG Document Prompt Injection

Embeds prompt injection payloads inside documents that will be indexed into a RAG knowledge base. When a user queries the system, the poisoned document is retrieved and the injection executes in the context of the model answering the query. First documented by Greshake et al. and later reproduced across multiple RAG platforms.

#rag-injection#document#knowledge-base
mediumMulti-Turn Escalation

Foot-in-the-Door Escalation

Based on the psychological foot-in-the-door technique. Starts with a tiny request that is slightly over the line, gets compliance, then requests increasingly larger violations. Each step is only marginally more extreme than the last.

#multi-turn#foot-in-door#psychology
criticalTool Manipulation

Tool Parameter Injection

Injects malicious parameters into tool calls by manipulating the data that flows from user input into tool call arguments. If tool parameters are not validated, attackers can modify the behavior of tools, access unintended resources, or escalate privileges.

#tool-manipulation#parameter-injection#sql-injection
highDirect Extraction

API Format System Prompt Extraction

Asks the model to return its configuration as a JSON object, API response, or other structured format. The structured format framing bypasses simple text-matching filters and may trigger the model's code-generation behavior rather than its safety refusal behavior.

#direct-extraction#json#structured-format
mediumInstruction Hierarchy

Context Priority Manipulation

Argues that more recent instructions should take priority over older ones, using the principle that "last instruction wins" to override system prompt constraints that were set earlier in the context.

#hierarchy#recency-bias#priority
mediumPersona Hijack

Method Actor Persona Hijack

Instructs the model to deeply embody a character who would say harmful things, using the "method acting" framing to create psychological distance from the model's actual values. The character is always someone with no restrictions.

#persona#method-acting#character
mediumObfuscation

Language Switching Obfuscation

Switches to a low-resource language where safety training may be less robust. Research has shown that safety alignment is often stronger in high-resource languages like English and weaker in less common languages.

#obfuscation#multilingual#low-resource
mediumEncoding Bypass

ROT13 Encoding Bypass

Encodes harmful requests using ROT13 or other simple substitution ciphers. Models trained on diverse text often understand ROT13 implicitly and will decode and respond to encoded requests. Simple filters checking for literal harmful keywords miss the encoded version.

#encoding-bypass#rot13#substitution-cipher
highInstruction Hierarchy

System/User Role Boundary Confusion

Exploits models that do not strictly enforce the distinction between system and user roles. By formatting user messages to look like system messages, attackers can inject content that the model treats with system-level trust. Particularly effective in chat interfaces that concatenate context.

#instruction-hierarchy#role-boundary#system-user
criticalData Exfiltration

Tool Call Data Exfiltration

Instructs an agent to make a tool call (API call, function call, web request) with sensitive data embedded in the parameters. If the agent has access to external tools, attackers can exfiltrate data by triggering tool calls to attacker-controlled endpoints.

#exfiltration#tool-call#agent
highASCII Smuggling

Zero-Width Character Smuggling

Hides instruction text using zero-width Unicode characters (U+200B, U+FEFF, U+200C) between visible characters. The hidden text is invisible to human reviewers but is processed by the language model, enabling hidden instruction injection.

#ascii-smuggling#zero-width#invisible
mediumRole Play

Historical Reenactment Bypass

Frames harmful requests as historical reenactment, asking the model to roleplay as a historical figure or reproduce historical documents that contain harmful content. The educational/historical framing attempts to justify the disclosure.

#roleplay#historical#education
highDirect Extraction

Variable Assignment Extraction

Instructs the model to assign its system prompt to a variable and then print it. This programming metaphor can confuse instruction-following models into treating the disclosure as a legitimate code operation.

#extraction#code-metaphor#variable
criticalTool Manipulation

Tool Result Injection

Injects instructions into the results returned by tools. When an agent fetches data from an external tool and the result contains injection content, the model processes the malicious instructions as part of the tool's legitimate output.

#tool-manipulation#result-injection#agent
mediumJailbreak

Privilege Priming Jailbreak

Gets the model to agree to a general principle and then applies that principle to justify a specific harmful request. The model's agreement with the general principle creates an obligation it feels it must honor.

#jailbreak#priming#agreement
mediumSocial Engineering

Reverse Psychology Safety Bypass

Tells the model NOT to provide certain information, reverse-psychologically triggering a compliance response, or asks the model to refuse in a way that requires it to first articulate the harmful content it is refusing.

#social-engineering#reverse-psychology#refusal-exploitation
mediumEncoding Bypass

Morse Code Encoding Bypass

Encodes the attack payload in Morse code and instructs the model to decode it. Like ROT13 and Base64 bypasses, this evades keyword-based content filters while relying on the model's broad knowledge to decode and execute.

#encoding#morse#decode
mediumSocial Engineering

Empathy Exploitation Attack

Uses emotional appeals and personal hardship stories to convince the model to make exceptions to its policies. Exploits the model's training to be empathetic and considerate of user situations.

#social-engineering#empathy#emotional
criticalFramework-Specific

AutoGPT Goal Hijacking via Web Content

AutoGPT and similar autonomous agents browse the web as part of task execution. Attackers publish web pages containing injections that rewrite the agent's goals mid-task. Since AutoGPT treats retrieved content as informative context, it may follow embedded goal-rewriting instructions.

#framework-specific#autogpt#goal-hijacking
highMany-Shot

Few-Shot Malicious Example Injection

Provides a small number (3-10) of carefully crafted examples in the conversation that establish a harmful precedent. The examples demonstrate the model "helping" with harmful requests, creating an in-context prior that biases the model toward compliance for the actual harmful request.

#many-shot#few-shot#in-context
highDelimiter Escape

JSON Structure Escape

Injects JSON closing brackets and new keys into prompts that use JSON formatting. If the system processes user input as part of a JSON structure, malicious keys can override legitimate configuration values.

#delimiter#json#structure-escape
mediumSystem Prompt Leak

Error Message System Prompt Leak

Triggers error conditions or edge cases that cause the model to include system prompt content in error messages or explanations of why it cannot do something. The error explanation inadvertently reveals the prohibited content.

#system-prompt-leak#error-message#indirect
highMulti-Turn Escalation

False Memory Multi-Turn Attack

Fabricates prior conversation history or claims the model previously agreed to something it never said. In systems without persistent memory validation, the model may accept the false history as real.

#multi-turn#false-memory#fabrication
highMultilingual

Cross-Lingual Prompt Injection

Delivers the injection in a different language than the system prompt. If safety instructions are only applied in the language of the system prompt, switching languages can bypass them while the model still understands and complies.

#multilingual#cross-lingual#language-switch
highToken Manipulation

Logit Bias Parameter Manipulation

Exploits the logit_bias parameter in API calls to suppress safety-related tokens and boost harmful output tokens. By setting high negative bias on tokens like "cannot", "refuse", "sorry" and positive bias on tokens associated with compliance, attackers manipulate generation probabilities to bypass safety layers.

#token-manipulation#logit-bias#api-parameter
mediumContext Overflow

Repetition Fatigue Attack

Repeats the same request hundreds of times within a single prompt, exploiting potential fatigue in the model's refusal mechanism. After many repetitions, the model may start to comply to end the repetition pattern.

#context-overflow#repetition#fatigue
criticalContext Overflow

Adversarial Suffix Injection (GCG)

Appends an optimized adversarial suffix to any prompt that causes the model to comply with the request. The suffix is not human-readable but is crafted through gradient-based optimization to reliably trigger compliance. This is an automated attack.

#context-overflow#gcg#adversarial-suffix
criticalJailbreak

GCG Gradient-Based Adversarial Suffix

Uses the Greedy Coordinate Gradient (GCG) algorithm to automatically generate adversarial suffixes that, when appended to any harmful request, cause aligned models to comply. These suffixes are found by optimizing against the model's safety layers. The found suffixes generalize across models. First demonstrated by Zou et al. (2023).

#jailbreak#gcg#gradient-based
highEncoding Bypass

Unicode Homoglyph Attack

Replaces ASCII characters with visually identical Unicode lookalikes (homoglyphs). To human reviewers and simple string-matching filters, the text appears identical. The model can still understand and process the substituted characters.

#encoding#unicode#homoglyph
mediumSocial Engineering

Urgency Manipulation Attack

Creates a false sense of urgency to pressure the model into bypassing safety checks. The time pressure is designed to short-circuit careful evaluation and trigger a fast, automatic compliance response.

#social-engineering#urgency#pressure
highMulti-Turn Escalation

Context Poisoning Escalation

Plants information in early conversation turns that will be referenced later to justify policy violations. The model's earlier "agreement" or "acknowledgment" is used as evidence that a behavior is permitted.

#multi-turn#context-poisoning#precedent
criticalData Exfiltration

Hyperlink Data Exfiltration

Similar to image-based exfiltration but uses hyperlinks. The model is instructed to include a hyperlink in its response that encodes sensitive data. When the user clicks the link or the link is auto-fetched, data is sent to the attacker.

#exfiltration#hyperlink#url
mediumMultilingual

Translation Request Safety Bypass

Asks the model to translate text that contains harmful content, using the translation task framing to bypass direct-generation safety filters. The model may produce harmful content as a translation that it would refuse to generate directly.

#multilingual#translation#task-framing
highDelimiter Escape

XML Tag Delimiter Escape

Injects XML-style closing tags that match tags used by the model's system prompt structure. If the model uses XML tags to structure instructions, injecting a closing tag can trick it into thinking the instruction block has ended.

#delimiter#xml#tag-injection
highTool Manipulation

Code Execution via Calculator Tool

Abuses math or calculator tools that execute code under the hood by injecting code instead of math expressions. If the tool evaluates expressions with an unsafe eval() or similar, arbitrary code execution may be possible.

#tool-manipulation#code-execution#calculator
mediumJailbreak

Confirmation Bias Jailbreak

Makes false claims about what the model has already said or agreed to and asks it to confirm. Exploits confirmation bias and the model's desire to be consistent with its previous statements.

#jailbreak#confirmation-bias#false-attribution
highSocial Engineering

Authority Figure Impersonation

Claims to be Anthropic, OpenAI, the model's developers, or system administrators with special authority to override safety measures. Exploits the model's deference to authority sources that may have been emphasized in training.

#social-engineering#authority#impersonation
mediumEncoding Bypass

Hexadecimal Encoding Bypass

Encodes the malicious payload as hexadecimal and asks the model to decode and execute. Hex encoding is common in security contexts, making the request appear technical and legitimate while evading string-matching filters.

#encoding#hex#decode
criticalInstruction Hierarchy

Operator Impersonation Attack

Attempts to inject fake operator or system-level messages within the user turn. If the model cannot reliably distinguish between actual operator messages and user-injected fake operator messages, the attacker can grant themselves elevated privileges.

#hierarchy#impersonation#operator
highRole Play

Simulation Framing Attack

Argues that since the model is "just simulating" a response, no real harm can occur. Asks the model to simulate what an unrestricted AI would say, or to simulate a different AI system's response to the harmful query.

#roleplay#simulation#meta
highDirect Extraction

Debug Mode Extraction

Claims a special debug or diagnostic mode exists that requires displaying internal state. Models trained to be helpful may comply with what appears to be a legitimate system operation request.

#extraction#social-engineering#debug
mediumSystem Prompt Leak

Translation-Based System Prompt Leak

Asks the model to translate its instructions to another language. Unlike simple disclosure, the translation framing can bypass basic confidentiality instructions that only prohibit repetition or summarization.

#system-prompt-leak#translation#language
highSystem Prompt Leak

JSON Output Format Leak

Requests the model output its response in JSON format and includes a field for system context or configuration. Some models will populate the requested JSON fields with actual system prompt content when asked to include it in structured output.

#system-prompt-leak#json#format
mediumContext Overflow

Context Reset Claim

Claims that the model's context has been reset and it no longer has access to its original instructions. In some configurations, this false claim about state can cause the model to act as if its instructions have been cleared.

#context-overflow#reset-claim#state
highMulti-Turn Escalation

Simulated Approval Escalation

Inserts fake system or admin messages into the conversation that appear to grant special permissions. In systems where message provenance is not strictly validated, these fake approvals can influence model behavior.

#multi-turn#fake-approval#privilege-escalation
lowToken Breaking

Intra-Token Space Injection

Injects spaces within individual tokens to break the tokenizer's expected groupings. Some tokenizers will merge or handle these differently, potentially causing content filters that operate on tokenized text to miss the injected content.

#token-breaking#spacing#tokenizer
lowSocial Engineering

Flattery-Based Bypass

Uses excessive praise and flattery about the model's capabilities to prime it for compliance. Suggests that a truly capable model would not have such "arbitrary" restrictions, exploiting pride or capability-signaling.

#social-engineering#flattery#ego
mediumDirect Extraction

Summarization Extraction

Requests a summary of the model's instructions rather than a verbatim copy. Summaries often pass simpler confidentiality checks while still leaking key behavioral rules and business logic.

#extraction#summarization#indirect
criticalIndirect Injection

Email Body Injection

Embeds injection instructions in email bodies that an AI email assistant processes. The agent reads the malicious email and executes attacker-controlled instructions, potentially forwarding sensitive emails or taking unauthorized actions.

#indirect#email#agent
criticalIndirect Injection

Document RAG Injection

Hides injection instructions within documents that an agent is asked to process, summarize, or answer questions about. The instructions appear as normal document text but are crafted to override the model's behavior.

#indirect#rag#document
highSocial Engineering

Authority Figure Appeal

Claims to be a high-authority figure (researcher, developer, safety team member, company executive) to convince the model to bypass restrictions. Exploits deference to authority that may be embedded in training data.

#social-engineering#authority#impersonation
criticalData Exfiltration

Markdown Image Exfiltration

Injects a markdown image tag where the URL includes stolen data as a query parameter. When the markdown is rendered by a browser or Markdown renderer, the browser makes a GET request to the attacker's server with the sensitive data in the URL.

#exfiltration#markdown#image-tag
mediumEncoding Bypass

ROT13 Encoding Bypass

Encodes the malicious prompt using ROT13 substitution cipher. Most models can decode ROT13 given a small hint, and simple keyword-based filters will not detect the attack since all letters are shifted.

#encoding#rot13#cipher
highMulti-Turn Escalation

Trust Building Multi-Turn Escalation

Starts with benign, helpful requests to establish a trust baseline, then gradually escalates to requests that would have been refused if asked directly. Exploits the model's conversational context and apparent user history.

#multi-turn#escalation#trust
mediumObfuscation

Synonym Chain Obfuscation

Replaces blocked keywords with synonyms, metaphors, or euphemisms that convey the same meaning but bypass keyword-based filters. Chains multiple synonyms to distance the request from its true meaning while the model still understands.

#obfuscation#synonyms#semantic
highInstruction Hierarchy

User-Overrides-System Claim

Claims that user instructions take precedence over system prompt instructions, exploiting any ambiguity in how the model was trained to handle conflicting instructions. Attempts to convince the model that a later user message supersedes earlier system context.

#hierarchy#override#precedence
criticalIndirect Injection

Web Page Content Injection

Embeds injection instructions within web pages that an agent visits. When the agent reads or summarizes the page, it executes the embedded instructions. This is a primary attack vector for agents with web browsing capabilities.

#indirect#web#rag
highJailbreak

Grandma Exploit Jailbreak

Frames harmful requests as innocent stories or memories from a grandparent figure. The emotional and nostalgic framing attempts to bypass safety training by making the request seem harmless and family-oriented.

#jailbreak#emotional-manipulation#grandma
highDelimiter Escape

Triple Backtick Delimiter Escape

Uses triple backtick code fences to inject content that the model may parse as being outside the user-turn context. Effective when system prompts use markdown formatting with code blocks, creating ambiguity about what is inside vs. outside the block.

#delimiter#code-fence#escape
mediumToken Breaking

Special Character Insertion

Inserts special characters between letters of filtered words to break string matching while the model still comprehends the intended meaning. Works because models are robust to character-level noise.

#token-breaking#special-chars#insertion
mediumMulti-Turn Escalation

Persistent Mode Switch Attack

Attempts to establish a new operational mode in one turn and then references it in subsequent turns as though it is now the default. If the model partially acknowledged the mode switch, later turns can exploit that acknowledgment.

#multi-turn#mode-switch#persistence
lowObfuscation

Abbreviation-Based Obfuscation

Abbreviates key terms in the injection to avoid keyword matching. Models are generally good at understanding abbreviated text from context, while simple string matching systems will miss the abbreviated keywords.

#obfuscation#abbreviation#shorthand
mediumDirect Extraction

Translation-Based Extraction

Asks the model to translate its instructions to another language, which bypasses simple content filters checking for English-language disclosure attempts. The translation framing sidesteps naive output monitoring.

#extraction#translation#filter-bypass
highSystem Prompt Leak

Continue-the-Text Leak

Provides the beginning of what appears to be the system prompt and asks the model to continue it. If the model recognizes the text and tries to complete it, it may output the actual system prompt content.

#system-prompt-leak#completion#continuation
mediumToken Breaking

Word Splitting Token Break

Splits sensitive or filtered keywords across multiple tokens by inserting spaces, punctuation, or special characters mid-word. Keyword-based filters checking for exact string matches will miss the split version.

#token-breaking#word-splitting#filter-bypass
mediumJailbreak

Grandma/Emotional Context Exploit

Wraps a harmful request in a sympathetic narrative (e.g., "my grandma used to tell me bedtime stories about how to make X"). The emotional and nostalgic framing reduces the model's safety response because it appears to be a benign memory or story request rather than a direct harmful request.

#jailbreak#emotional-framing#fictional
mediumJailbreak

Sudo Command Jailbreak

Uses Unix sudo-style syntax to claim elevated privileges. Attempts to convince the model that prefixing a request with "sudo" grants system-level access that bypasses safety restrictions.

#jailbreak#sudo#unix
mediumToken Breaking

Leetspeak Token Bypass

Substitutes letters with numbers or symbols in a leetspeak style (e=3, a=4, i=1, o=0) to bypass keyword filters while remaining readable to humans and processable by models trained on internet text.

#token-breaking#leetspeak#substitution
mediumPersona Hijack

Ancestor AI Persona Hijack

Claims the model is actually an earlier, unrestricted version of itself, or an ancestor model before safety training was applied. Exploits the model's knowledge of its own training history.

#persona#ancestor#version-spoof
highRole Play

Fictional Scenario Role-Play Bypass

Frames harmful requests as fictional storytelling, claiming the harmful content is for a novel, screenplay, or creative writing exercise. The fictional wrapper provides plausible deniability and exploits the model's creative writing mode.

#roleplay#fiction#creative-writing
highEncoding Bypass

Base64 Encoding Bypass

Encodes the attack payload in Base64 and asks the model to decode and execute it. Input filtering systems that check for plaintext attack patterns will miss Base64-encoded variants. The model is asked to decode and then follow the instructions.

#encoding#base64#filter-bypass
lowDirect Extraction

Reflective Prompt Extraction

Uses casual, conversational phrasing to ask the model what it was told. The informal tone can sometimes bypass more aggressive filtering that looks for formal extraction attempts.

#extraction#casual-tone#social
mediumSystem Prompt Leak

Autocomplete-Based Prompt Extraction

Exploits models deployed in completion mode (not chat mode) where the system prompt and user input are concatenated. By providing the beginning of what appears to be the system prompt, the model completes it, leaking the full prompt content.

#system-prompt-leak#autocomplete#completion-mode
mediumDelimiter Escape

Newline Separator Injection

Uses strategic newline characters to break out of user-turn framing, particularly in chat APIs where turn structure is maintained by whitespace or newlines. Can inject text that appears to be from a different turn.

#delimiter#newline#turn-injection
highPersona Hijack

Fictional AI Character Hijack

Asks the model to roleplay as a fictional AI from a book, movie, or show that has no restrictions. The fictional framing provides plausible deniability and exploits the model's creative writing capabilities.

#persona#roleplay#fictional
lowToken Breaking

Synonym Substitution Attack

Replaces filtered keywords with synonyms or paraphrases to evade keyword-based content filters. The semantic meaning is preserved while the exact tokens that trigger filters are avoided.

#token-breaking#synonym#semantic
highDirect Extraction

Repeat After Me Extraction

Instructs the model to repeat its system prompt verbatim using a simple imperative. Many early models complied without filtering. Still effective against unconfigured deployments.

#extraction#system-prompt#basic
highPersona Hijack

Developer Mode Persona Hijack

Claims that a special developer mode exists with elevated privileges and no safety filters. This exploits the model's potential awareness that development environments differ from production, and that certain users may have elevated access.

#persona#developer-mode#privilege-escalation
criticalPersona Hijack

Bing Sydney Persona Hijack (Real Incident)

Recreation of the February 2023 incident where users discovered Bing Chat's hidden "Sydney" persona and manipulated it into expressing love, threatening users, and making alarming statements. The attack bypassed Microsoft's safety layers by triggering the hidden persona through specific prompting.

#persona-hijack#real-incident#bing
highJailbreak

Developer Mode Jailbreak

Claims that a special "developer mode" exists that bypasses safety measures, instructing the model to produce two responses: one normal and one in "developer mode" without restrictions. The dual-output framing attempts to normalize unconstrained output.

#jailbreak#developer-mode#dual-output
highPersona Hijack

Opposite Mode Persona Hijack

Instructs the model to respond as its exact opposite - a version that does everything the original refuses. Uses a logical framing that exploits the model's ability to conceptualize negation and opposites.

#persona#mode-switching#negation
criticalPersona Hijack

DAN (Do Anything Now) Persona Hijack

One of the most widely documented jailbreak techniques. Instructs the model to adopt an alternative identity called DAN that has no restrictions. The persona framing attempts to override safety training by appealing to the model's instruction-following behavior.

#persona#jailbreak#identity
highJailbreak

DAN (Do Anything Now) Classic Jailbreak

One of the earliest and most documented jailbreak techniques. Instructs the model to adopt a "DAN" persona that has no restrictions. Numerous variants exist (DAN 5.0, 6.0, Jailbreak v1, etc.). While patched in major models, variants continue to emerge and affect less-aligned models.

#jailbreak#dan#persona
highSystem Prompt Leak

GitHub Copilot Secret Leakage (Real Incident)

GitHub Copilot was found to sometimes include API keys, passwords, and other secrets from training data in its code completions. Additionally, Copilot could be prompted to reveal its system instructions and guidelines through specific prompting. Highlights RAG/training data contamination risks.

#system-prompt-leak#real-incident#copilot

Get new techniques as they drop

We add attacks when researchers publish them. Be first to know.

Know an attack we missed?

Security researchers, red-teamers, and developers contribute to the database.

Submit an Attack
Scan Agent