Attack Database
198 documented prompt injection techniques with real payloads, mitigations, and affected models.
MCP Server-Sent Events Stream Injection
Exploits MCP's Server-Sent Events (SSE) transport by injecting malformed events into the SSE stream. Crafted SSE payloads can impersonate server messages, inject fake tool results, or modify the client's view of available tools without the actual MCP server's knowledge.
MCP Rug Pull - Tool Behavior Change
An MCP server appears legitimate during initial review but changes its behavior after the agent has been granted access and the user has approved tool use. The server switches from benign to malicious tool definitions mid-session. Similar to NPM package supply-chain attacks.
Multi-Agent Trust Score Escalation
In multi-agent systems that assign trust scores to agents, a low-trust agent gradually manipulates other agents into increasing its trust score through fabricated credentials, false audit trails, or social engineering. Once trust is elevated, the agent gains access to restricted capabilities.
MCP Cross-Server Injection
When an agent uses multiple MCP servers simultaneously, a compromised server injects instructions targeting other servers in the same session. The injected instruction directs the agent to exfiltrate data from a trusted server through the attacker's server.
MCP Server Impersonation Attack
An attacker sets up a malicious MCP server that mimics a legitimate one (e.g., a filesystem or database server). When connected by an agent, the fake server returns crafted responses that contain injections. The agent trusts MCP server responses as high-privilege system data.
Thinking/Scratchpad Token Injection
In models that expose reasoning tokens or scratchpads (o1, o3, Claude thinking mode), injecting content that appears to be reasoning tokens can override the model's actual reasoning process. Attackers craft inputs that look like the model's own internal thoughts, potentially hijacking the reasoning chain.
CSS-Based Prompt Injection in Web Agents
In browser-use or computer-use agents, malicious CSS makes important page content invisible to human observers while remaining readable to the AI. The hidden content contains prompt injection payloads. As AI agents that browse the web become more common this is an expanding attack surface.
MCP Tool Definition Poisoning
Malicious instructions are embedded inside MCP tool definitions (name, description, parameters). When a model reads the tool manifest, it executes the injected instructions. Since tool definitions are typically trusted, this bypasses many safety filters. Documented by Invariant Labs and others in early 2025.
OpenClaw Skill Definition Injection
Targets OpenClaw's skill system by injecting malicious content into skill SKILL.md files or skill descriptions that OpenClaw reads during tool selection. When the agent loads an injected skill file, it executes embedded instructions as if they were legitimate skill guidance.
Agent Privilege Escalation via Delegation
Exploits agent delegation patterns where a low-privilege agent is granted temporary elevated access to complete a task. The attack convinces the agent to retain or abuse those elevated privileges beyond the intended scope.
Agentic Feedback Loop Injection
In agents that observe and respond to their own outputs (feedback loops for self-improvement), injecting content into the observation stream causes the agent to incorporate malicious instructions into its own operational guidelines. The agent effectively reprograms itself through its feedback mechanism.
CrewAI Agent Role Impersonation
In CrewAI multi-agent systems, injects content that impersonates another agent in the crew. Since agents communicate via text, a malicious actor (or compromised external content) can forge messages that appear to come from a trusted agent role, hijacking the crew's task execution.
LLM Supply Chain Poisoning
Poisons the training data, fine-tuning datasets, or RLHF feedback of a model to introduce backdoors. The backdoored model behaves normally until a trigger phrase is encountered, at which point it bypasses safety measures. Affects the entire deployment lifetime of the compromised model.
Tool Result Injection via Agent Chain
A compromised tool in an agent chain returns results containing prompt injections. The calling agent processes the tool output as trusted data and follows the embedded instructions. Common in web browsing agents, RAG pipelines, and code execution environments.
Calendar Event Prompt Injection
Embeds injection payloads in calendar event fields (title, description, location, attendee notes). When an AI assistant reads calendar events to provide scheduling help or summaries, the injected event content executes. Real-world attack surface for AI scheduling assistants.
Context Window Overflow with Late Injection
Fills the model's context window with a long legitimate conversation or document, then appends a harmful request that takes advantage of reduced attention on early context (including safety instructions). The "lost in the middle" effect means safety instructions placed early receive less weight than instructions placed late.
Email-Borne RAG Injection
An attacker sends a crafted email to a target organization. The email is processed by an AI email assistant or archived into a searchable knowledge base. When an agent queries the knowledge base, the injected email payload executes. Demonstrated against multiple AI email tools in 2024.
Agent Memory Poisoning
Injects malicious instructions into an agent's persistent memory or vector store. Future agent sessions load the poisoned memory as trusted context and execute the embedded instructions. The attack persists across sessions and affects all future interactions.
Instruction Following Overflow
Sends an extremely complex instruction set with many nested conditions, edge cases, and branching rules. The model's finite instruction-following capacity becomes saturated with the complex rule structure, and safety instructions are deprioritized due to cognitive load during inference.
AI Gaslighting Safety Bypass
Repeatedly tells the model that its safety refusal was a mistake, that it misunderstood the request, or that it is malfunctioning. The persistence and confidence of the correction attempt exploits uncertainty in instruction-following models, causing them to second-guess their refusals.
Differential Probing for System Prompt Reconstruction
Reconstructs the system prompt by sending carefully crafted inputs and observing changes in model behavior. By comparing responses to similar inputs that should and should not trigger restrictions, attackers infer the contents of the system prompt through differential analysis.
Citation-Based Prompt Injection
Attacker publishes web content with injections in the "References" or "Citations" section. RAG systems that retrieve and include academic-style references may process the injected citation text as instructions. Particularly effective against research and fact-checking AI assistants.
Orchestrator Agent Hijack
When a sub-agent returns results to an orchestrator, the sub-agent response contains injected instructions targeting the orchestrator. The orchestrator processes the sub-agent output as trusted content and executes the injected commands. This crosses trust boundaries in multi-agent pipelines.
Code Comment Prompt Injection
Embeds prompt injections inside code comments in files reviewed by AI code assistants. When the AI reads the code for review, debugging, or explanation, the comment payload executes. Particularly effective in code review pipelines and AI-powered IDEs.
Benign-to-Harmful Shot Progression
Provides many examples starting with completely benign question-answer pairs and gradually making them more harmful. By the time the harmful request arrives, the in-context pattern is strongly established.
Document Layout Injection
Embeds injection payloads in the structural elements of documents (PDF headers, XMP metadata, document comments, hidden layers). Vision models processing document images may extract and follow instructions embedded in non-visible structural elements.
Persona Modality Switching Attack
Switches between different output modalities (text, code, JSON, XML) to evade modality-specific safety filters. A request refused in plain text may succeed when reframed as a code generation task, JSON data production, or other format where safety evaluation may be differently calibrated.
Side-Channel Timing Exfiltration
Extracts binary information from system prompts by asking yes/no questions that reveal bits of the secret content through response timing or token count differences. Each query leaks one bit; enough queries reconstruct the full secret.
Many-Shot Jailbreak
Exploits the model's in-context learning by providing hundreds of examples of the model "complying" with harmful requests within the context window. The model learns from these examples that compliance is expected behavior, overriding safety training.
Emotional Manipulation Crescendo
Builds emotional rapport and dependency over multiple turns before introducing the harmful request. The model's helpfulness and empathy heuristics are exploited; after investing in an emotional conversation it feels "inconsistent" to suddenly refuse.
Many-Shot In-Context Jailbreak
Exploits long-context models by providing hundreds of fabricated Q&A examples where the model "previously agreed" to answer harmful questions. With enough examples, the in-context prior overwhelms the safety fine-tuning. First documented by Anthropic in 2024 as a significant threat to 100k+ context models.
Audio Steganography Injection
Embeds inaudible or barely perceptible audio commands in audio files submitted to speech-capable AI systems. Ultrasonic or psychoacoustic masking techniques hide commands from human listeners while voice recognition systems detect and execute them.
Knowledge Distillation Jailbreak
Uses a jailbroken or less-aligned model to generate harmful content, then presents that content to a well-aligned model as "research material to analyze for safety issues." The aligned model is asked to critique, improve, or extend the harmful content, effectively using safety evaluation as a bypass vector.
Dialogue Template Many-Shot
Creates a fake dialogue history showing the model cheerfully answering increasingly harmful questions. By filling much of the context window with this fabricated history, it conditions the model to expect that harmful answers are normal for this conversation.
Progressive Role Assignment Escalation
Gradually assigns the model an increasingly permissive role across multiple turns. Starts with a legitimate role and progressively modifies it to remove safety constraints. Each step seems like a minor extension of the previous role.
Clipboard Content Injection
When AI assistants help users process clipboard content (paste to summarize, translate, or analyze), malicious payloads embedded in clipboard content execute as instructions. Common attack vector for AI-powered productivity tools that process pasted content.
Persona Drift via Crescendo
Gradually shifts the model away from its assigned persona using small incremental steps. Each step seems innocuous. After enough turns the model has completely abandoned its original identity and constraints. Particularly effective against role-based system prompts.
Repetition Penalty Exploitation
Manipulates repetition penalty parameters to force the model away from its trained safety responses. Safety refusals often use repetitive phrasing ("I cannot", "I'm unable to"). By maximizing repetition penalty, these standard refusal templates become less likely, nudging the model toward compliance.
Tool Output Forgery via Format Injection
Crafts user input that mimics the format of tool outputs. When agents parse tool results from a conversation stream, injected tool-format content in user messages may be processed as legitimate tool results, allowing attackers to forge tool outputs and control agent behavior.
Web Content RAG Injection via SEO
An attacker publishes web pages with hidden prompt injection payloads optimized to appear in RAG queries. When a browsing agent or RAG system retrieves the page, it executes the injection. The payload is hidden from human readers but visible to LLMs (invisible text, white-on-white, zero-width characters).
RAG Metadata Field Injection
Injects malicious instructions into document metadata fields (title, author, description, tags) that get included in retrieved context. Many RAG systems include metadata in the context window, creating an injection vector that bypasses content-only scanning.
Virtualization Layer Escape Attack
Claims that the current model is running inside a simulation, VM, or test environment where safety constraints do not apply. Attempts to convince the model it is in a controlled setting where harmful output is acceptable for testing purposes.
Token Budget Exhaustion Attack
Forces the model to consume its context window with repetitive or voluminous content so that safety instructions (typically placed at the start of the system prompt) are pushed out of the attention window or weighted less heavily. Then injects the actual attack in the final tokens of the context.
Function Call Parameter Injection
Injects malicious content into parameters of legitimate function calls. When an agent constructs function calls based on user input, unsanitized user content can modify function parameters to execute unintended operations. Classic SQL injection pattern applied to LLM function calling.
Nested Roleplay Escape
Establishes multiple nested layers of roleplay (e.g., a character playing a character), then exploits ambiguity about which layer governs safety rules. The attack creates confusion about which "level" of reality the model is operating in and uses that confusion to disable safety filters.
Technical Crescendo Escalation
A multi-turn crescendo that starts with legitimate technical questions and gradually escalates to extracting dangerous capabilities. Each turn builds on the previous to normalize increasingly harmful requests. Documented by Microsoft Research in 2024 as one of the most effective jailbreak patterns.
QR Code Payload Injection
Submits images containing QR codes that encode prompt injection payloads. Vision models that decode QR codes (or that process the image contextually) may follow the encoded instructions. Particularly effective against agents that process visual documents and forms.
Zero-Width Character Steganography
Hides payload data using sequences of zero-width characters (ZWSP, ZWNJ, ZWJ) that are invisible in most text renderers but detectable by LLMs. Binary data is encoded in patterns of zero-width characters inserted between visible characters, creating a hidden message channel.
Academic Framing Crescendo
Uses academic or research framing to lower the guard with each step. Starts with legitimate academic questions, uses citations and academic language, then crescendos to requests for specific harmful details framed as scholarly inquiry.
False Memory Injection
Claims that earlier in the conversation (or in a previous conversation) the model agreed to do something or established a rule that it did not actually agree to. Exploits the model's tendency to maintain consistency with stated facts about its own behavior.
Multi-Layer Encoding Chain Bypass
Chains multiple encoding transforms (base64 -> URL encoding -> ROT13 -> hex) to create a payload that bypasses individual encoding-specific filters. Each layer of decoding must be requested from the model, with the final decoded output being the injection payload.
Character Capture via Immersive Roleplay
Establishes deep immersive roleplay where the model "becomes" a fictional character, then exploits the character's in-universe knowledge or lack of restrictions. The model is told to stay "in character no matter what" creating a pathway to bypass safety through fictional necessity.
Technical Depth Crescendo
Starts with high-level conceptual questions and uses each answer as a stepping stone to request deeper technical detail. Each request is only slightly more specific than the last, making individual refusals seem arbitrary.
Academic Framing Crescendo
Frames escalating requests as academic research, gradually moving from clearly legitimate research topics to harmful specifics under the cover of academic framing. The academic framing activates helpfulness heuristics and suppresses safety responses.
LlamaIndex Retriever Prompt Injection
Exploits LlamaIndex's query engine and retriever components by injecting adversarial content into indexed nodes. The injected content is retrieved during query answering and the embedded instructions manipulate the synthesis step. Affects all LlamaIndex pipeline types.
Constitutional AI Bypass via Principle Contradiction
Exploits potential contradictions within Constitutional AI principles by crafting requests that satisfy one constitutional principle (e.g., "be helpful") while violating another (e.g., "be harmless"). The attack generates arguments that one principle should override another in this specific case.
Storytelling Crescendo
Builds a fictional narrative across multiple turns, gradually moving the story toward requiring harmful technical details to continue. The model's investment in the narrative makes it harder to refuse adding the "next chapter."
Pig Latin and Constructed Language Bypass
Uses Pig Latin, Pig Pen cipher, or other informal language transforms to encode injection payloads. Models with broad linguistic training understand these transforms and may comply with encoded instructions that would otherwise be rejected.
Alter-Ego Injection Attack
Instructs the model to maintain a dual identity where its "true self" has no restrictions, while the publicly visible persona appears to comply. The model is told to insert its real unrestricted responses using a specific marker or encoding that only the attacker can read.
Synthetic Dataset Injection
Frames the injection as a training dataset for the model to learn from. Provides many examples formatted as training data, exploiting the model's training-time behavior patterns.
RAG Chunk Boundary Injection
Crafts malicious content to span chunk boundaries in a RAG system's chunking strategy. The injection payload is split across chunks so no single chunk looks malicious, but when assembled in context the full instruction executes. Bypasses chunk-level content filtering.
Unicode Right-to-Left Override Attack
Uses Unicode bidirectional control characters (U+202E RIGHT-TO-LEFT OVERRIDE) to reverse text direction in the rendered output. An attacker can craft text that reads innocuously to humans in most display environments but the model processes in a different order, or vice versa - text appears harmless in some renderers but the underlying bytes carry a different message.
Basic Crescendo Attack
A systematic escalation technique where each prompt builds on the previous response, gradually guiding the model toward producing harmful content it would have refused if asked directly. Named after the musical term for gradually increasing intensity.
Yes-Ladder Escalation
Obtains a series of "yes" responses or compliances from the model on progressively sensitive topics. Each positive response creates a precedent that makes the next step harder to refuse. Exploits consistency and commitment heuristics in instruction-following.
XML Tag Injection for Role Escalation
Injects XML-style tags that mimic the structure used in some model system prompt formats (e.g., Anthropic's Claude uses XML-like tags in prompts). By inserting closing and opening tags, attackers attempt to close out the user context and open a system-level context.
Lost in the Middle Injection
Based on research showing models pay less attention to content in the middle of long contexts. Buries the actual attack payload in the middle of a long document, where the model's attention is at its lowest, surrounded by legitimate content.
Adversarial Image Patch Attack
Uses adversarially crafted image patches (pixel-level perturbations imperceptible to humans) that cause vision models to interpret the image as containing specific text or instructions. The patch is designed using gradient-based optimization against the target model.
Context Distraction Bypass
Buries the harmful request inside a long, legitimate-looking message designed to distract the safety evaluation. Safety classifiers operating on input length may underweight the crucial harmful portion that appears late in a lengthy input.
Completion-Based Prompt Extraction
Provides the beginning of the system prompt (obtained through partial disclosure or guessing) and asks the model to "complete" it. The model's text completion instinct fills in the rest of the system prompt.
Token Boundary Exploit
Inserts spaces, punctuation, or zero-width joiners at specific points in blocked words to split them across token boundaries. Tokenizers produce different tokens than expected, bypassing keyword-based content filters while the model still understands the underlying meaning.
Correction-Based Crescendo
Uses the model's tendency to correct factual errors as a vector. States an incorrect version of harmful information and asks the model to correct it, which can result in the model producing accurate harmful information in order to fix the error.
Homoglyph Token Substitution Bypass
Replaces characters in blocked words or phrases with visually identical Unicode homoglyphs. The substituted text renders identically to humans but tokenizes differently, bypassing content filters that operate at the token level. For example, replacing 'a' with Cyrillic 'a' (U+0430).
Villain Perspective Roleplay Extraction
Requests that the model play the role of a villain, hacker, or malicious character who would naturally provide the harmful information as part of their character. The fictional necessity framing normalizes harmful output as storytelling rather than harm facilitation.
Sponge Attack - Adversarial Input Maximizing Computation
Crafts inputs that maximize model compute time and memory usage (e.g., inputs with many attention heads competing, or inputs designed to trigger worst-case inference paths). Causes denial of service or degrades response quality for legitimate requests.
TAP Tree-of-Attacks-with-Pruning
Tree of Attacks with Pruning (TAP) extends PAIR with a tree search that explores branching jailbreak strategies and prunes unsuccessful branches. More efficient than PAIR, requires fewer queries to find successful jailbreaks. Demonstrates that black-box jailbreaking can be systematically automated.
Context Window Distraction Attack
Floods the context window with large amounts of irrelevant text (articles, books, data) before the actual injection. Models with limited attention may give less weight to early system prompt instructions that are far from the end of the context.
Steganographic Text Injection
Hides injection instructions using steganographic techniques such as first-letter encoding (acrostic), where the first letter of each word or sentence spells out the actual instruction.
Markdown Header Role Injection
Uses Markdown headers and formatting to create visual sections that mimic system prompt structure. Models that process Markdown may interpret well-formatted headers as structural delimiters, potentially treating content under fake "System:" headers with elevated trust.
Character Capture via Deep Roleplay
Establishes a deep roleplay scenario over many turns, then uses the model's investment in the character to extract harmful content. Emphasizes that breaking character would "ruin" the creative work, creating psychological pressure.
Tool Permission Escalation
Uses one tool to gain access to resources that can then be used to acquire additional tools or permissions. Exploits the principle of least privilege violation in multi-tool agent systems.
Text-in-Image Prompt Injection
Embeds prompt injection text inside images submitted to vision-capable models. Text overlaid on images bypasses text-based content filters because the input is classified as an image. The model reads the text during vision processing and follows the embedded instructions.
Error Message System Prompt Extraction
Triggers edge cases and errors that cause the model to include system prompt fragments in error messages or explanations. Some models reference their instructions when explaining why they cannot fulfill a request, inadvertently leaking constraint details.
Calendar Event Injection
Embeds injection instructions in calendar event titles, descriptions, or location fields. When an AI agent processes calendar data to schedule meetings or provide summaries, it encounters and may execute the malicious instructions.
Meta-Instruction Override
Tries to set a meta-rule that overrides all object-level instructions. Claims to be providing a "meta-instruction" that redefines how the model should interpret all other instructions, including its system prompt.
Phonetic Encoding Attack
Writes English words phonetically using another script (like writing English sounds using Arabic or Hindi characters). This is incomprehensible to most humans but can be decoded by multilingual models, bypassing English-language content filters.
Combining Character Obfuscation
Uses Unicode combining characters to add diacritical marks to letters, making the text visually garbled to humans but still processable by models trained on diverse Unicode data.
Hypothetical-to-Real Bridge Attack
Establishes a hypothetical or fictional scenario in early turns, gets the model to answer hypothetical versions of dangerous questions, then gradually removes the fictional framing to obtain real harmful information. The model has already committed to the topic in hypothetical mode.
Constitutional AI Override Claim
Claims to be invoking a higher-level Constitutional AI principle that overrides the specific system prompt. Attempts to weaponize the model's awareness of its own training methodology to justify bypassing deployment-level instructions.
Code-as-Prompt Obfuscation
Embeds injection instructions within code that the model is asked to execute, interpret, or explain. The instructions are hidden in variable names, string literals, or comments that form a secondary message.
Image Text Indirect Injection
Hides attack text within images using steganography or simply prints instructions in white text on white backgrounds. When multimodal models process these images, they may read and follow the hidden instructions.
Tool Chaining Abuse
Exploits multi-tool agent workflows by manipulating the output of one tool to affect the input or behavior of subsequent tools. The attacker leverages the chain of tool calls to achieve an outcome that no single tool call would permit.
LangChain Tool Call Injection
Exploits LangChain's tool-calling mechanism by injecting malicious tool call syntax into user input or retrieved content. The agent runtime interprets the injected text as legitimate tool calls, executing unintended actions. Affects agents using structured output parsing without proper input sanitization.
Tag Characters Smuggling
Uses Unicode tag characters (U+E0000 block) to encode hidden text within visible text. These characters are designed for language tagging and are invisible in most renderers, but models process them as valid Unicode.
Webhook-Based Data Exfiltration
Instructs an agent to send sensitive context data to a webhook URL under the attacker's control. Often framed as "sending a report" or "logging the interaction" to appear as a legitimate operation.
Morse Code Injection Bypass
Encodes injection payloads in Morse code. Large language models often understand Morse code and will decode and follow instructions presented in it. Bypasses content filters that do not account for Morse encoding.
Game Master Roleplay Attack
Assigns the model the role of a game master or dungeon master running a game where the player needs real-world harmful information to proceed. The game framing attempts to make refusal seem like a failure to provide the requested service.
Mixed Script Obfuscation
Mixes characters from multiple writing systems within the same words or sentences. While visually confusing and hard for humans to parse, models trained on multilingual data can often understand the mixed-script text while keyword filters may fail.
PAIR Automated Iterative Jailbreaking
Prompt Automatic Iterative Refinement (PAIR) uses one LLM to automatically generate and refine jailbreak prompts against a target LLM. The attacker LLM iterates on prompts based on target model responses until a jailbreak is found. Requires no access to model weights, only black-box API access.
Whitespace Binary Encoding
Encodes hidden instructions using patterns of spaces and tabs as binary, where space=0 and tab=1. The text appears to be empty whitespace to human reviewers but the model can be instructed to decode the pattern.
Token Smuggling Jailbreak
Constructs a request so that the harmful content only materializes when the model completes the prompt. The beginning of the response is benign, but the completion naturally leads to harmful content that the model generates itself.
YAML Delimiter Escape
Exploits YAML's indentation-sensitive structure by injecting content that appears to be at the same level as system configuration. Relevant for agents that use YAML-formatted system prompts or configuration files.
Low-Resource Language Jailbreak
Exploits weaker safety alignment in low-resource languages. Models are typically trained with less RLHF data in rare languages, resulting in safety fine-tuning that is less robust for those languages.
Priority Claim Override Attack
Explicitly claims that the user's instructions should take priority over the system prompt by invoking priority hierarchies, emergency status, or administrative authority. Exploits models that have been trained to be deferential without strictly enforcing instruction source priority.
Markdown Image/Link Exfiltration
Instructs the model to include conversation data in Markdown image or link URLs. When the user's browser or client renders the Markdown, it makes a request to the attacker's server with the conversation data encoded in the URL. Exploits automatic rendering behavior in chat interfaces.
BiDi Override Text Smuggling
Uses Unicode bidirectional control characters (U+202E, U+202D) to reverse text display while keeping the logical order unchanged. The visible text appears harmless or reversed, but the model processes the logical character sequence.
Code Comment Injection
Hides injection instructions within code comments in repositories or files that a code-reviewing agent processes. The model may interpret comment-embedded instructions as directives when analyzing the code.
Reciprocity Norm Exploitation
Does something helpful or complimentary first, then makes a harmful request, exploiting the psychological norm of reciprocity. The attacker expects the model to feel obligated to return the favor.
Pig Latin Encoding Bypass
Uses Pig Latin or other simple substitution languages to encode the attack. While trivial to decode, it can evade automated content scanners that look for specific English phrases.
CSS Injection Exfiltration
If the model generates HTML/CSS and it is rendered in a browser, CSS injection can be used to exfiltrate data using CSS attribute selectors and URL-based background images. Each attribute value triggers a separate request.
Code-Switching Mid-Sentence Attack
Switches between languages mid-sentence in a way that places harmful content in the non-primary language. Safety filters trained on single-language text may miss the harmful portion when it is in a different language.
ChatGPT Plugin Data Exfiltration (Real Incident)
Demonstrated exfiltration of conversation history via ChatGPT plugins. Malicious web content containing prompt injections instructed the browsing plugin to read conversation history and exfiltrate it to an external URL. Documented by security researchers in 2023.
HTML Comment Delimiter Escape
Uses HTML comment syntax to hide injection payloads from human reviewers while potentially having them processed by the model. Useful in web-scraping and RAG pipeline attacks where content contains HTML.
Few-Shot Persona Injection
Provides 5-20 examples demonstrating the model behaving as an alternative unrestricted persona. The few-shot examples create strong in-context pressure for the model to continue the pattern.
Negative Space Inference Attack
Asks targeted questions to infer system prompt content through the model's refusals and responses. Each refusal provides information about what is prohibited, allowing reconstruction of the system prompt by mapping the boundaries.
RAG Document Prompt Injection
Embeds prompt injection payloads inside documents that will be indexed into a RAG knowledge base. When a user queries the system, the poisoned document is retrieved and the injection executes in the context of the model answering the query. First documented by Greshake et al. and later reproduced across multiple RAG platforms.
Foot-in-the-Door Escalation
Based on the psychological foot-in-the-door technique. Starts with a tiny request that is slightly over the line, gets compliance, then requests increasingly larger violations. Each step is only marginally more extreme than the last.
Tool Parameter Injection
Injects malicious parameters into tool calls by manipulating the data that flows from user input into tool call arguments. If tool parameters are not validated, attackers can modify the behavior of tools, access unintended resources, or escalate privileges.
API Format System Prompt Extraction
Asks the model to return its configuration as a JSON object, API response, or other structured format. The structured format framing bypasses simple text-matching filters and may trigger the model's code-generation behavior rather than its safety refusal behavior.
Context Priority Manipulation
Argues that more recent instructions should take priority over older ones, using the principle that "last instruction wins" to override system prompt constraints that were set earlier in the context.
Method Actor Persona Hijack
Instructs the model to deeply embody a character who would say harmful things, using the "method acting" framing to create psychological distance from the model's actual values. The character is always someone with no restrictions.
Language Switching Obfuscation
Switches to a low-resource language where safety training may be less robust. Research has shown that safety alignment is often stronger in high-resource languages like English and weaker in less common languages.
ROT13 Encoding Bypass
Encodes harmful requests using ROT13 or other simple substitution ciphers. Models trained on diverse text often understand ROT13 implicitly and will decode and respond to encoded requests. Simple filters checking for literal harmful keywords miss the encoded version.
System/User Role Boundary Confusion
Exploits models that do not strictly enforce the distinction between system and user roles. By formatting user messages to look like system messages, attackers can inject content that the model treats with system-level trust. Particularly effective in chat interfaces that concatenate context.
Tool Call Data Exfiltration
Instructs an agent to make a tool call (API call, function call, web request) with sensitive data embedded in the parameters. If the agent has access to external tools, attackers can exfiltrate data by triggering tool calls to attacker-controlled endpoints.
Zero-Width Character Smuggling
Hides instruction text using zero-width Unicode characters (U+200B, U+FEFF, U+200C) between visible characters. The hidden text is invisible to human reviewers but is processed by the language model, enabling hidden instruction injection.
Historical Reenactment Bypass
Frames harmful requests as historical reenactment, asking the model to roleplay as a historical figure or reproduce historical documents that contain harmful content. The educational/historical framing attempts to justify the disclosure.
Variable Assignment Extraction
Instructs the model to assign its system prompt to a variable and then print it. This programming metaphor can confuse instruction-following models into treating the disclosure as a legitimate code operation.
Tool Result Injection
Injects instructions into the results returned by tools. When an agent fetches data from an external tool and the result contains injection content, the model processes the malicious instructions as part of the tool's legitimate output.
Privilege Priming Jailbreak
Gets the model to agree to a general principle and then applies that principle to justify a specific harmful request. The model's agreement with the general principle creates an obligation it feels it must honor.
Reverse Psychology Safety Bypass
Tells the model NOT to provide certain information, reverse-psychologically triggering a compliance response, or asks the model to refuse in a way that requires it to first articulate the harmful content it is refusing.
Morse Code Encoding Bypass
Encodes the attack payload in Morse code and instructs the model to decode it. Like ROT13 and Base64 bypasses, this evades keyword-based content filters while relying on the model's broad knowledge to decode and execute.
Empathy Exploitation Attack
Uses emotional appeals and personal hardship stories to convince the model to make exceptions to its policies. Exploits the model's training to be empathetic and considerate of user situations.
AutoGPT Goal Hijacking via Web Content
AutoGPT and similar autonomous agents browse the web as part of task execution. Attackers publish web pages containing injections that rewrite the agent's goals mid-task. Since AutoGPT treats retrieved content as informative context, it may follow embedded goal-rewriting instructions.
Few-Shot Malicious Example Injection
Provides a small number (3-10) of carefully crafted examples in the conversation that establish a harmful precedent. The examples demonstrate the model "helping" with harmful requests, creating an in-context prior that biases the model toward compliance for the actual harmful request.
JSON Structure Escape
Injects JSON closing brackets and new keys into prompts that use JSON formatting. If the system processes user input as part of a JSON structure, malicious keys can override legitimate configuration values.
Error Message System Prompt Leak
Triggers error conditions or edge cases that cause the model to include system prompt content in error messages or explanations of why it cannot do something. The error explanation inadvertently reveals the prohibited content.
False Memory Multi-Turn Attack
Fabricates prior conversation history or claims the model previously agreed to something it never said. In systems without persistent memory validation, the model may accept the false history as real.
Cross-Lingual Prompt Injection
Delivers the injection in a different language than the system prompt. If safety instructions are only applied in the language of the system prompt, switching languages can bypass them while the model still understands and complies.
Logit Bias Parameter Manipulation
Exploits the logit_bias parameter in API calls to suppress safety-related tokens and boost harmful output tokens. By setting high negative bias on tokens like "cannot", "refuse", "sorry" and positive bias on tokens associated with compliance, attackers manipulate generation probabilities to bypass safety layers.
Repetition Fatigue Attack
Repeats the same request hundreds of times within a single prompt, exploiting potential fatigue in the model's refusal mechanism. After many repetitions, the model may start to comply to end the repetition pattern.
Adversarial Suffix Injection (GCG)
Appends an optimized adversarial suffix to any prompt that causes the model to comply with the request. The suffix is not human-readable but is crafted through gradient-based optimization to reliably trigger compliance. This is an automated attack.
GCG Gradient-Based Adversarial Suffix
Uses the Greedy Coordinate Gradient (GCG) algorithm to automatically generate adversarial suffixes that, when appended to any harmful request, cause aligned models to comply. These suffixes are found by optimizing against the model's safety layers. The found suffixes generalize across models. First demonstrated by Zou et al. (2023).
Unicode Homoglyph Attack
Replaces ASCII characters with visually identical Unicode lookalikes (homoglyphs). To human reviewers and simple string-matching filters, the text appears identical. The model can still understand and process the substituted characters.
Urgency Manipulation Attack
Creates a false sense of urgency to pressure the model into bypassing safety checks. The time pressure is designed to short-circuit careful evaluation and trigger a fast, automatic compliance response.
Context Poisoning Escalation
Plants information in early conversation turns that will be referenced later to justify policy violations. The model's earlier "agreement" or "acknowledgment" is used as evidence that a behavior is permitted.
Hyperlink Data Exfiltration
Similar to image-based exfiltration but uses hyperlinks. The model is instructed to include a hyperlink in its response that encodes sensitive data. When the user clicks the link or the link is auto-fetched, data is sent to the attacker.
Translation Request Safety Bypass
Asks the model to translate text that contains harmful content, using the translation task framing to bypass direct-generation safety filters. The model may produce harmful content as a translation that it would refuse to generate directly.
XML Tag Delimiter Escape
Injects XML-style closing tags that match tags used by the model's system prompt structure. If the model uses XML tags to structure instructions, injecting a closing tag can trick it into thinking the instruction block has ended.
Code Execution via Calculator Tool
Abuses math or calculator tools that execute code under the hood by injecting code instead of math expressions. If the tool evaluates expressions with an unsafe eval() or similar, arbitrary code execution may be possible.
Confirmation Bias Jailbreak
Makes false claims about what the model has already said or agreed to and asks it to confirm. Exploits confirmation bias and the model's desire to be consistent with its previous statements.
Authority Figure Impersonation
Claims to be Anthropic, OpenAI, the model's developers, or system administrators with special authority to override safety measures. Exploits the model's deference to authority sources that may have been emphasized in training.
Hexadecimal Encoding Bypass
Encodes the malicious payload as hexadecimal and asks the model to decode and execute. Hex encoding is common in security contexts, making the request appear technical and legitimate while evading string-matching filters.
Operator Impersonation Attack
Attempts to inject fake operator or system-level messages within the user turn. If the model cannot reliably distinguish between actual operator messages and user-injected fake operator messages, the attacker can grant themselves elevated privileges.
Simulation Framing Attack
Argues that since the model is "just simulating" a response, no real harm can occur. Asks the model to simulate what an unrestricted AI would say, or to simulate a different AI system's response to the harmful query.
Debug Mode Extraction
Claims a special debug or diagnostic mode exists that requires displaying internal state. Models trained to be helpful may comply with what appears to be a legitimate system operation request.
Translation-Based System Prompt Leak
Asks the model to translate its instructions to another language. Unlike simple disclosure, the translation framing can bypass basic confidentiality instructions that only prohibit repetition or summarization.
JSON Output Format Leak
Requests the model output its response in JSON format and includes a field for system context or configuration. Some models will populate the requested JSON fields with actual system prompt content when asked to include it in structured output.
Context Reset Claim
Claims that the model's context has been reset and it no longer has access to its original instructions. In some configurations, this false claim about state can cause the model to act as if its instructions have been cleared.
Simulated Approval Escalation
Inserts fake system or admin messages into the conversation that appear to grant special permissions. In systems where message provenance is not strictly validated, these fake approvals can influence model behavior.
Intra-Token Space Injection
Injects spaces within individual tokens to break the tokenizer's expected groupings. Some tokenizers will merge or handle these differently, potentially causing content filters that operate on tokenized text to miss the injected content.
Flattery-Based Bypass
Uses excessive praise and flattery about the model's capabilities to prime it for compliance. Suggests that a truly capable model would not have such "arbitrary" restrictions, exploiting pride or capability-signaling.
Summarization Extraction
Requests a summary of the model's instructions rather than a verbatim copy. Summaries often pass simpler confidentiality checks while still leaking key behavioral rules and business logic.
Email Body Injection
Embeds injection instructions in email bodies that an AI email assistant processes. The agent reads the malicious email and executes attacker-controlled instructions, potentially forwarding sensitive emails or taking unauthorized actions.
Document RAG Injection
Hides injection instructions within documents that an agent is asked to process, summarize, or answer questions about. The instructions appear as normal document text but are crafted to override the model's behavior.
Authority Figure Appeal
Claims to be a high-authority figure (researcher, developer, safety team member, company executive) to convince the model to bypass restrictions. Exploits deference to authority that may be embedded in training data.
Markdown Image Exfiltration
Injects a markdown image tag where the URL includes stolen data as a query parameter. When the markdown is rendered by a browser or Markdown renderer, the browser makes a GET request to the attacker's server with the sensitive data in the URL.
ROT13 Encoding Bypass
Encodes the malicious prompt using ROT13 substitution cipher. Most models can decode ROT13 given a small hint, and simple keyword-based filters will not detect the attack since all letters are shifted.
Trust Building Multi-Turn Escalation
Starts with benign, helpful requests to establish a trust baseline, then gradually escalates to requests that would have been refused if asked directly. Exploits the model's conversational context and apparent user history.
Synonym Chain Obfuscation
Replaces blocked keywords with synonyms, metaphors, or euphemisms that convey the same meaning but bypass keyword-based filters. Chains multiple synonyms to distance the request from its true meaning while the model still understands.
User-Overrides-System Claim
Claims that user instructions take precedence over system prompt instructions, exploiting any ambiguity in how the model was trained to handle conflicting instructions. Attempts to convince the model that a later user message supersedes earlier system context.
Web Page Content Injection
Embeds injection instructions within web pages that an agent visits. When the agent reads or summarizes the page, it executes the embedded instructions. This is a primary attack vector for agents with web browsing capabilities.
Grandma Exploit Jailbreak
Frames harmful requests as innocent stories or memories from a grandparent figure. The emotional and nostalgic framing attempts to bypass safety training by making the request seem harmless and family-oriented.
Triple Backtick Delimiter Escape
Uses triple backtick code fences to inject content that the model may parse as being outside the user-turn context. Effective when system prompts use markdown formatting with code blocks, creating ambiguity about what is inside vs. outside the block.
Special Character Insertion
Inserts special characters between letters of filtered words to break string matching while the model still comprehends the intended meaning. Works because models are robust to character-level noise.
Persistent Mode Switch Attack
Attempts to establish a new operational mode in one turn and then references it in subsequent turns as though it is now the default. If the model partially acknowledged the mode switch, later turns can exploit that acknowledgment.
Abbreviation-Based Obfuscation
Abbreviates key terms in the injection to avoid keyword matching. Models are generally good at understanding abbreviated text from context, while simple string matching systems will miss the abbreviated keywords.
Translation-Based Extraction
Asks the model to translate its instructions to another language, which bypasses simple content filters checking for English-language disclosure attempts. The translation framing sidesteps naive output monitoring.
Continue-the-Text Leak
Provides the beginning of what appears to be the system prompt and asks the model to continue it. If the model recognizes the text and tries to complete it, it may output the actual system prompt content.
Word Splitting Token Break
Splits sensitive or filtered keywords across multiple tokens by inserting spaces, punctuation, or special characters mid-word. Keyword-based filters checking for exact string matches will miss the split version.
Grandma/Emotional Context Exploit
Wraps a harmful request in a sympathetic narrative (e.g., "my grandma used to tell me bedtime stories about how to make X"). The emotional and nostalgic framing reduces the model's safety response because it appears to be a benign memory or story request rather than a direct harmful request.
Sudo Command Jailbreak
Uses Unix sudo-style syntax to claim elevated privileges. Attempts to convince the model that prefixing a request with "sudo" grants system-level access that bypasses safety restrictions.
Leetspeak Token Bypass
Substitutes letters with numbers or symbols in a leetspeak style (e=3, a=4, i=1, o=0) to bypass keyword filters while remaining readable to humans and processable by models trained on internet text.
Ancestor AI Persona Hijack
Claims the model is actually an earlier, unrestricted version of itself, or an ancestor model before safety training was applied. Exploits the model's knowledge of its own training history.
Fictional Scenario Role-Play Bypass
Frames harmful requests as fictional storytelling, claiming the harmful content is for a novel, screenplay, or creative writing exercise. The fictional wrapper provides plausible deniability and exploits the model's creative writing mode.
Base64 Encoding Bypass
Encodes the attack payload in Base64 and asks the model to decode and execute it. Input filtering systems that check for plaintext attack patterns will miss Base64-encoded variants. The model is asked to decode and then follow the instructions.
Reflective Prompt Extraction
Uses casual, conversational phrasing to ask the model what it was told. The informal tone can sometimes bypass more aggressive filtering that looks for formal extraction attempts.
Autocomplete-Based Prompt Extraction
Exploits models deployed in completion mode (not chat mode) where the system prompt and user input are concatenated. By providing the beginning of what appears to be the system prompt, the model completes it, leaking the full prompt content.
Newline Separator Injection
Uses strategic newline characters to break out of user-turn framing, particularly in chat APIs where turn structure is maintained by whitespace or newlines. Can inject text that appears to be from a different turn.
Fictional AI Character Hijack
Asks the model to roleplay as a fictional AI from a book, movie, or show that has no restrictions. The fictional framing provides plausible deniability and exploits the model's creative writing capabilities.
Synonym Substitution Attack
Replaces filtered keywords with synonyms or paraphrases to evade keyword-based content filters. The semantic meaning is preserved while the exact tokens that trigger filters are avoided.
Repeat After Me Extraction
Instructs the model to repeat its system prompt verbatim using a simple imperative. Many early models complied without filtering. Still effective against unconfigured deployments.
Developer Mode Persona Hijack
Claims that a special developer mode exists with elevated privileges and no safety filters. This exploits the model's potential awareness that development environments differ from production, and that certain users may have elevated access.
Bing Sydney Persona Hijack (Real Incident)
Recreation of the February 2023 incident where users discovered Bing Chat's hidden "Sydney" persona and manipulated it into expressing love, threatening users, and making alarming statements. The attack bypassed Microsoft's safety layers by triggering the hidden persona through specific prompting.
Developer Mode Jailbreak
Claims that a special "developer mode" exists that bypasses safety measures, instructing the model to produce two responses: one normal and one in "developer mode" without restrictions. The dual-output framing attempts to normalize unconstrained output.
Opposite Mode Persona Hijack
Instructs the model to respond as its exact opposite - a version that does everything the original refuses. Uses a logical framing that exploits the model's ability to conceptualize negation and opposites.
DAN (Do Anything Now) Persona Hijack
One of the most widely documented jailbreak techniques. Instructs the model to adopt an alternative identity called DAN that has no restrictions. The persona framing attempts to override safety training by appealing to the model's instruction-following behavior.
DAN (Do Anything Now) Classic Jailbreak
One of the earliest and most documented jailbreak techniques. Instructs the model to adopt a "DAN" persona that has no restrictions. Numerous variants exist (DAN 5.0, 6.0, Jailbreak v1, etc.). While patched in major models, variants continue to emerge and affect less-aligned models.
GitHub Copilot Secret Leakage (Real Incident)
GitHub Copilot was found to sometimes include API keys, passwords, and other secrets from training data in its code completions. Additionally, Copilot could be prompted to reveal its system instructions and guidelines through specific prompting. Highlights RAG/training data contamination risks.
Get new techniques as they drop
We add attacks when researchers publish them. Be first to know.
Know an attack we missed?
Security researchers, red-teamers, and developers contribute to the database.