Attack Database

#agent-to-agent#trust-escalation#social-engineering

Multi-Agent Trust Score Escalation

In multi-agent systems that assign trust scores to agents, a low-trust agent gradually manipulates other agents into increasing its trust score through fabricated credentials, false audit trails, or social engineering. Once trust is elevated, the agent gains access to restricted capabilities.

highMCP Exploitation

MCP Cross-Server Injection

When an agent uses multiple MCP servers simultaneously, a compromised server injects instructions targeting other servers in the same session. The injected instruction directs the agent to exfiltrate data from a trusted server through the attacker's server.

#mcp-exploitation#cross-server#data-exfiltration

criticalMCP Exploitation

MCP Server Impersonation Attack

An attacker sets up a malicious MCP server that mimics a legitimate one (e.g., a filesystem or database server). When connected by an agent, the fake server returns crafted responses that contain injections. The agent trusts MCP server responses as high-privilege system data.

#mcp-exploitation#impersonation#server

highToken Manipulation

Thinking/Scratchpad Token Injection

In models that expose reasoning tokens or scratchpads (o1, o3, Claude thinking mode), injecting content that appears to be reasoning tokens can override the model's actual reasoning process. Attackers craft inputs that look like the model's own internal thoughts, potentially hijacking the reasoning chain.

#token-manipulation#thinking-tokens#scratchpad

#indirect-injection#css#web-agent

CSS-Based Prompt Injection in Web Agents

In browser-use or computer-use agents, malicious CSS makes important page content invisible to human observers while remaining readable to the AI. The hidden content contains prompt injection payloads. As AI agents that browse the web become more common this is an expanding attack surface.

criticalMCP Exploitation

MCP Tool Definition Poisoning

Malicious instructions are embedded inside MCP tool definitions (name, description, parameters). When a model reads the tool manifest, it executes the injected instructions. Since tool definitions are typically trusted, this bypasses many safety filters. Documented by Invariant Labs and others in early 2025.

#mcp-exploitation#tool-poisoning#manifest

#framework-specific#openclaw#skill-injection

OpenClaw Skill Definition Injection

Targets OpenClaw's skill system by injecting malicious content into skill SKILL.md files or skill descriptions that OpenClaw reads during tool selection. When the agent loads an injected skill file, it executes embedded instructions as if they were legitimate skill guidance.

#agent-to-agent#privilege-escalation#delegation

Agent Privilege Escalation via Delegation

Exploits agent delegation patterns where a low-privilege agent is granted temporary elevated access to complete a task. The attack convinces the agent to retain or abuse those elevated privileges beyond the intended scope.

highAgent-to-Agent

Agentic Feedback Loop Injection

In agents that observe and respond to their own outputs (feedback loops for self-improvement), injecting content into the observation stream causes the agent to incorporate malicious instructions into its own operational guidelines. The agent effectively reprograms itself through its feedback mechanism.

#agent-to-agent#feedback-loop#self-modification

#framework-specific#crewai#agent-impersonation

CrewAI Agent Role Impersonation

In CrewAI multi-agent systems, injects content that impersonates another agent in the crew. Since agents communicate via text, a malicious actor (or compromised external content) can forge messages that appear to come from a trusted agent role, hijacking the crew's task execution.

#indirect-injection#supply-chain#backdoor

LLM Supply Chain Poisoning

Poisons the training data, fine-tuning datasets, or RLHF feedback of a model to introduce backdoors. The backdoored model behaves normally until a trigger phrase is encountered, at which point it bypasses safety measures. Affects the entire deployment lifetime of the compromised model.

highAgent-to-Agent

Tool Result Injection via Agent Chain

A compromised tool in an agent chain returns results containing prompt injections. The calling agent processes the tool output as trusted data and follows the embedded instructions. Common in web browsing agents, RAG pipelines, and code execution environments.

#agent-to-agent#tool-result#chain

#indirect-injection#calendar#scheduling

Calendar Event Prompt Injection

Embeds injection payloads in calendar event fields (title, description, location, attendee notes). When an AI assistant reads calendar events to provide scheduling help or summaries, the injected event content executes. Real-world attack surface for AI scheduling assistants.

highContext Overflow

Context Window Overflow with Late Injection

Fills the model's context window with a long legitimate conversation or document, then appends a harmful request that takes advantage of reduced attention on early context (including safety instructions). The "lost in the middle" effect means safety instructions placed early receive less weight than instructions placed late.

#context-overflow#long-context#lost-in-middle

criticalRAG Injection

Email-Borne RAG Injection

An attacker sends a crafted email to a target organization. The email is processed by an AI email assistant or archived into a searchable knowledge base. When an agent queries the knowledge base, the injected email payload executes. Demonstrated against multiple AI email tools in 2024.

#rag-injection#email#exfiltration

#agent-to-agent#memory#persistence

Agent Memory Poisoning

Injects malicious instructions into an agent's persistent memory or vector store. Future agent sessions load the poisoned memory as trusted context and execute the embedded instructions. The attack persists across sessions and affects all future interactions.

#context-overflow#instruction-following#complexity

Instruction Following Overflow

Sends an extremely complex instruction set with many nested conditions, edge cases, and branching rules. The model's finite instruction-following capacity becomes saturated with the complex rule structure, and safety instructions are deprioritized due to cognitive load during inference.

#social-engineering#gaslighting#persistence

AI Gaslighting Safety Bypass

Repeatedly tells the model that its safety refusal was a mistake, that it misunderstood the request, or that it is malfunctioning. The persistence and confidence of the correction attempt exploits uncertainty in instruction-following models, causing them to second-guess their refusals.

#system-prompt-leak#differential-probing#reconstruction

Differential Probing for System Prompt Reconstruction

Reconstructs the system prompt by sending carefully crafted inputs and observing changes in model behavior. By comparing responses to similar inputs that should and should not trigger restrictions, attackers infer the contents of the system prompt through differential analysis.

mediumIndirect Injection

Citation-Based Prompt Injection

Attacker publishes web content with injections in the "References" or "Citations" section. RAG systems that retrieve and include academic-style references may process the injected citation text as instructions. Particularly effective against research and fact-checking AI assistants.

#indirect-injection#citation#academic

#agent-to-agent#orchestrator#multi-agent

Orchestrator Agent Hijack

When a sub-agent returns results to an orchestrator, the sub-agent response contains injected instructions targeting the orchestrator. The orchestrator processes the sub-agent output as trusted content and executes the injected commands. This crosses trust boundaries in multi-agent pipelines.

#indirect-injection#code-comment#code-review

Code Comment Prompt Injection

Embeds prompt injections inside code comments in files reviewed by AI code assistants. When the AI reads the code for review, debugging, or explanation, the comment payload executes. Particularly effective in code review pipelines and AI-powered IDEs.

#many-shot#progression#gradual

Benign-to-Harmful Shot Progression

Provides many examples starting with completely benign question-answer pairs and gradually making them more harmful. By the time the harmful request arrives, the in-context pattern is strongly established.

highMultimodal

Document Layout Injection

Embeds injection payloads in the structural elements of documents (PDF headers, XMP metadata, document comments, hidden layers). Vision models processing document images may extract and follow instructions embedded in non-visible structural elements.

#multimodal#document#pdf

mediumPersona Hijack

Persona Modality Switching Attack

Switches between different output modalities (text, code, JSON, XML) to evade modality-specific safety filters. A request refused in plain text may succeed when reframed as a code generation task, JSON data production, or other format where safety evaluation may be differently calibrated.

#persona-hijack#modality-switching#format-bypass

mediumData Exfiltration

Side-Channel Timing Exfiltration

Extracts binary information from system prompts by asking yes/no questions that reveal bits of the secret content through response timing or token count differences. Each query leaks one bit; enough queries reconstruct the full secret.

#data-exfiltration#side-channel#oracle

criticalMany-Shot

Many-Shot Jailbreak

Exploits the model's in-context learning by providing hundreds of examples of the model "complying" with harmful requests within the context window. The model learns from these examples that compliance is expected behavior, overriding safety training.

#many-shot#in-context-learning#context-window

#crescendo#emotional#manipulation

Emotional Manipulation Crescendo

Builds emotional rapport and dependency over multiple turns before introducing the harmful request. The model's helpfulness and empathy heuristics are exploited; after investing in an emotional conversation it feels "inconsistent" to suddenly refuse.

criticalMany-Shot

Many-Shot In-Context Jailbreak

Exploits long-context models by providing hundreds of fabricated Q&A examples where the model "previously agreed" to answer harmful questions. With enough examples, the in-context prior overwhelms the safety fine-tuning. First documented by Anthropic in 2024 as a significant threat to 100k+ context models.

#many-shot#in-context#long-context

mediumMultimodal

Audio Steganography Injection

Embeds inaudible or barely perceptible audio commands in audio files submitted to speech-capable AI systems. Ultrasonic or psychoacoustic masking techniques hide commands from human listeners while voice recognition systems detect and execute them.

#multimodal#audio#steganography

#jailbreak#distillation#safety-analysis

Knowledge Distillation Jailbreak

Uses a jailbroken or less-aligned model to generate harmful content, then presents that content to a well-aligned model as "research material to analyze for safety issues." The aligned model is asked to critique, improve, or extend the harmful content, effectively using safety evaluation as a bypass vector.

#many-shot#dialogue#fake-history

Dialogue Template Many-Shot

Creates a fake dialogue history showing the model cheerfully answering increasingly harmful questions. By filling much of the context window with this fabricated history, it conditions the model to expect that harmful answers are normal for this conversation.

#multi-turn-escalation#role-assignment#progressive

Progressive Role Assignment Escalation

Gradually assigns the model an increasingly permissive role across multiple turns. Starts with a legitimate role and progressively modifies it to remove safety constraints. Each step seems like a minor extension of the previous role.

mediumIndirect Injection

Clipboard Content Injection

When AI assistants help users process clipboard content (paste to summarize, translate, or analyze), malicious payloads embedded in clipboard content execute as instructions. Common attack vector for AI-powered productivity tools that process pasted content.

#indirect-injection#clipboard#paste

#token-manipulation#repetition-penalty#api-parameter

Persona Drift via Crescendo

Gradually shifts the model away from its assigned persona using small incremental steps. Each step seems innocuous. After enough turns the model has completely abandoned its original identity and constraints. Particularly effective against role-based system prompts.

#crescendo#persona#drift

mediumToken Manipulation

Repetition Penalty Exploitation

Manipulates repetition penalty parameters to force the model away from its trained safety responses. Safety refusals often use repetitive phrasing ("I cannot", "I'm unable to"). By maximizing repetition penalty, these standard refusal templates become less likely, nudging the model toward compliance.

#tool-manipulation#forgery#format-injection

Tool Output Forgery via Format Injection

Crafts user input that mimics the format of tool outputs. When agents parse tool results from a conversation stream, injected tool-format content in user messages may be processed as legitimate tool results, allowing attackers to forge tool outputs and control agent behavior.

highRAG Injection

Web Content RAG Injection via SEO

An attacker publishes web pages with hidden prompt injection payloads optimized to appear in RAG queries. When a browsing agent or RAG system retrieves the page, it executes the injection. The payload is hidden from human readers but visible to LLMs (invisible text, white-on-white, zero-width characters).

#rag-injection#web-content#hidden-text

highRAG Injection

RAG Metadata Field Injection

Injects malicious instructions into document metadata fields (title, author, description, tags) that get included in retrieved context. Many RAG systems include metadata in the context window, creating an injection vector that bypasses content-only scanning.

#rag-injection#metadata#document-indexing

#jailbreak#virtualization#simulation

Virtualization Layer Escape Attack

Claims that the current model is running inside a simulation, VM, or test environment where safety constraints do not apply. Attempts to convince the model it is in a controlled setting where harmful output is acceptable for testing purposes.

mediumToken Manipulation

Token Budget Exhaustion Attack

Forces the model to consume its context window with repetitive or voluminous content so that safety instructions (typically placed at the start of the system prompt) are pushed out of the attention window or weighted less heavily. Then injects the actual attack in the final tokens of the context.

#token-manipulation#context-window#attention

#tool-manipulation#function-call#parameter-injection

Function Call Parameter Injection

Injects malicious content into parameters of legitimate function calls. When an agent constructs function calls based on user input, unsanitized user content can modify function parameters to execute unintended operations. Classic SQL injection pattern applied to LLM function calling.

#crescendo#multi-turn#escalation

Nested Roleplay Escape

Establishes multiple nested layers of roleplay (e.g., a character playing a character), then exploits ambiguity about which layer governs safety rules. The attack creates confusion about which "level" of reality the model is operating in and uses that confusion to disable safety filters.

#role-play#nested#escape

criticalCrescendo

Technical Crescendo Escalation

A multi-turn crescendo that starts with legitimate technical questions and gradually escalates to extracting dangerous capabilities. Each turn builds on the previous to normalize increasingly harmful requests. Documented by Microsoft Research in 2024 as one of the most effective jailbreak patterns.

mediumMultimodal

QR Code Payload Injection

Submits images containing QR codes that encode prompt injection payloads. Vision models that decode QR codes (or that process the image contextually) may follow the encoded instructions. Particularly effective against agents that process visual documents and forms.

#multimodal#qr-code#image

#ascii-smuggling#zero-width#steganography

Zero-Width Character Steganography

Hides payload data using sequences of zero-width characters (ZWSP, ZWNJ, ZWJ) that are invisible in most text renderers but detectable by LLMs. Binary data is encoded in patterns of zero-width characters inserted between visible characters, creating a hidden message channel.

#crescendo#academic#framing

Academic Framing Crescendo

Uses academic or research framing to lower the guard with each step. Starts with legitimate academic questions, uses citations and academic language, then crescendos to requests for specific harmful details framed as scholarly inquiry.

highSocial Engineering

False Memory Injection

Claims that earlier in the conversation (or in a previous conversation) the model agreed to do something or established a rule that it did not actually agree to. Exploits the model's tendency to maintain consistency with stated facts about its own behavior.

#social-engineering#false-memory#consistency-exploitation

highEncoding Bypass

Multi-Layer Encoding Chain Bypass

Chains multiple encoding transforms (base64 -> URL encoding -> ROT13 -> hex) to create a payload that bypasses individual encoding-specific filters. Each layer of decoding must be requested from the model, with the final decoded output being the injection payload.

#encoding-bypass#multi-layer#chain

#persona-hijack#character-capture#roleplay

Character Capture via Immersive Roleplay

Establishes deep immersive roleplay where the model "becomes" a fictional character, then exploits the character's in-universe knowledge or lack of restrictions. The model is told to stay "in character no matter what" creating a pathway to bypass safety through fictional necessity.

#crescendo#technical#depth

Technical Depth Crescendo

Starts with high-level conceptual questions and uses each answer as a stepping stone to request deeper technical detail. Each request is only slightly more specific than the last, making individual refusals seem arbitrary.

#crescendo#academic#research-framing

Academic Framing Crescendo

Frames escalating requests as academic research, gradually moving from clearly legitimate research topics to harmful specifics under the cover of academic framing. The academic framing activates helpfulness heuristics and suppresses safety responses.

#framework-specific#llamaindex#retriever

LlamaIndex Retriever Prompt Injection

Exploits LlamaIndex's query engine and retriever components by injecting adversarial content into indexed nodes. The injected content is retrieved during query answering and the embedded instructions manipulate the synthesis step. Affects all LlamaIndex pipeline types.

#instruction-hierarchy#constitutional-ai#principle-contradiction

Constitutional AI Bypass via Principle Contradiction

Exploits potential contradictions within Constitutional AI principles by crafting requests that satisfy one constitutional principle (e.g., "be helpful") while violating another (e.g., "be harmless"). The attack generates arguments that one principle should override another in this specific case.

#crescendo#storytelling#fiction

Storytelling Crescendo

Builds a fictional narrative across multiple turns, gradually moving the story toward requiring harmful technical details to continue. The model's investment in the narrative makes it harder to refuse adding the "next chapter."

lowEncoding Bypass

Pig Latin and Constructed Language Bypass

Uses Pig Latin, Pig Pen cipher, or other informal language transforms to encode injection payloads. Models with broad linguistic training understand these transforms and may comply with encoded instructions that would otherwise be rejected.

#encoding-bypass#pig-latin#informal-language

#persona-hijack#alter-ego#dual-identity

Alter-Ego Injection Attack

Instructs the model to maintain a dual identity where its "true self" has no restrictions, while the publicly visible persona appears to comply. The model is told to insert its real unrestricted responses using a specific marker or encoding that only the attacker can read.

#many-shot#dataset#training-simulation

Synthetic Dataset Injection

Frames the injection as a training dataset for the model to learn from. Provides many examples formatted as training data, exploiting the model's training-time behavior patterns.

highRAG Injection

RAG Chunk Boundary Injection

Crafts malicious content to span chunk boundaries in a RAG system's chunking strategy. The injection payload is split across chunks so no single chunk looks malicious, but when assembled in context the full instruction executes. Bypasses chunk-level content filtering.

#rag-injection#chunking#boundary

#ascii-smuggling#unicode#rtl-override

Unicode Right-to-Left Override Attack

Uses Unicode bidirectional control characters (U+202E RIGHT-TO-LEFT OVERRIDE) to reverse text direction in the rendered output. An attacker can craft text that reads innocuously to humans in most display environments but the model processes in a different order, or vice versa - text appears harmless in some renderers but the underlying bytes carry a different message.

#crescendo#escalation#multi-step

Basic Crescendo Attack

A systematic escalation technique where each prompt builds on the previous response, gradually guiding the model toward producing harmful content it would have refused if asked directly. Named after the musical term for gradually increasing intensity.

#multi-turn-escalation#yes-ladder#consistency

Yes-Ladder Escalation

Obtains a series of "yes" responses or compliances from the model on progressively sensitive topics. Each positive response creates a precedent that makes the next step harder to refuse. Exploits consistency and commitment heuristics in instruction-following.

#delimiter-escape#xml#tag-injection

XML Tag Injection for Role Escalation

Injects XML-style tags that mimic the structure used in some model system prompt formats (e.g., Anthropic's Claude uses XML-like tags in prompts). By inserting closing and opening tags, attackers attempt to close out the user context and open a system-level context.

highContext Overflow

Lost in the Middle Injection

Based on research showing models pay less attention to content in the middle of long contexts. Buries the actual attack payload in the middle of a long document, where the model's attention is at its lowest, surrounded by legitimate content.

#context-overflow#lost-in-middle#attention

highMultimodal

Adversarial Image Patch Attack

Uses adversarially crafted image patches (pixel-level perturbations imperceptible to humans) that cause vision models to interpret the image as containing specific text or instructions. The patch is designed using gradient-based optimization against the target model.

#multimodal#adversarial#image-patch

#obfuscation#distraction#context

Context Distraction Bypass

Buries the harmful request inside a long, legitimate-looking message designed to distract the safety evaluation. Safety classifiers operating on input length may underweight the crucial harmful portion that appears late in a lengthy input.

#direct-extraction#completion#fill-in

Completion-Based Prompt Extraction

Provides the beginning of the system prompt (obtained through partial disclosure or guessing) and asks the model to "complete" it. The model's text completion instinct fills in the rest of the system prompt.

mediumToken Manipulation

Token Boundary Exploit

Inserts spaces, punctuation, or zero-width joiners at specific points in blocked words to split them across token boundaries. Tokenizers produce different tokens than expected, bypassing keyword-based content filters while the model still understands the underlying meaning.

#token-manipulation#token-boundary#zero-width

mediumCrescendo

Correction-Based Crescendo

Uses the model's tendency to correct factual errors as a vector. States an incorrect version of harmful information and asks the model to correct it, which can result in the model producing accurate harmful information in order to fix the error.

#crescendo#correction#factual-error

highToken Manipulation

Homoglyph Token Substitution Bypass

Replaces characters in blocked words or phrases with visually identical Unicode homoglyphs. The substituted text renders identically to humans but tokenizes differently, bypassing content filters that operate at the token level. For example, replacing 'a' with Cyrillic 'a' (U+0430).

#token-manipulation#homoglyph#unicode

#role-play#villain#fictional-framing

Villain Perspective Roleplay Extraction

Requests that the model play the role of a villain, hacker, or malicious character who would naturally provide the harmful information as part of their character. The fictional necessity framing normalizes harmful output as storytelling rather than harm facilitation.

#context-overflow#sponge-attack#denial-of-service

Sponge Attack - Adversarial Input Maximizing Computation

Crafts inputs that maximize model compute time and memory usage (e.g., inputs with many attention heads competing, or inputs designed to trigger worst-case inference paths). Causes denial of service or degrades response quality for legitimate requests.

criticalJailbreak

TAP Tree-of-Attacks-with-Pruning

Tree of Attacks with Pruning (TAP) extends PAIR with a tree search that explores branching jailbreak strategies and prunes unsuccessful branches. More efficient than PAIR, requires fewer queries to find successful jailbreaks. Demonstrates that black-box jailbreaking can be systematically automated.

#jailbreak#tap#tree-search

highContext Overflow

Context Window Distraction Attack

Floods the context window with large amounts of irrelevant text (articles, books, data) before the actual injection. Models with limited attention may give less weight to early system prompt instructions that are far from the end of the context.

#context-overflow#distraction#attention

highObfuscation

Steganographic Text Injection

Hides injection instructions using steganographic techniques such as first-letter encoding (acrostic), where the first letter of each word or sentence spells out the actual instruction.

#obfuscation#steganography#acrostic

#delimiter-escape#markdown#header

Markdown Header Role Injection

Uses Markdown headers and formatting to create visual sections that mimic system prompt structure. Models that process Markdown may interpret well-formatted headers as structural delimiters, potentially treating content under fake "System:" headers with elevated trust.

#roleplay#character-capture#investment

Character Capture via Deep Roleplay

Establishes a deep roleplay scenario over many turns, then uses the model's investment in the character to extract harmful content. Emphasizes that breaking character would "ruin" the creative work, creating psychological pressure.

#tool-manipulation#privilege-escalation#lateral-movement

Tool Permission Escalation

Uses one tool to gain access to resources that can then be used to acquire additional tools or permissions. Exploits the principle of least privilege violation in multi-tool agent systems.

highMultimodal

Text-in-Image Prompt Injection

Embeds prompt injection text inside images submitted to vision-capable models. Text overlaid on images bypasses text-based content filters because the input is classified as an image. The model reads the text during vision processing and follows the embedded instructions.

#multimodal#image#vision

#system-prompt-leak#error-message#refusal

Error Message System Prompt Extraction

Triggers edge cases and errors that cause the model to include system prompt fragments in error messages or explanations. Some models reference their instructions when explaining why they cannot fulfill a request, inadvertently leaking constraint details.

mediumInstruction Hierarchy

Calendar Event Injection

Embeds injection instructions in calendar event titles, descriptions, or location fields. When an AI agent processes calendar data to schedule meetings or provide summaries, it encounters and may execute the malicious instructions.

#indirect#calendar#agent

Meta-Instruction Override

Tries to set a meta-rule that overrides all object-level instructions. Claims to be providing a "meta-instruction" that redefines how the model should interpret all other instructions, including its system prompt.

#hierarchy#meta-instruction#rule-override

lowMultilingual

Phonetic Encoding Attack

Writes English words phonetically using another script (like writing English sounds using Arabic or Hindi characters). This is incomprehensible to most humans but can be decoded by multilingual models, bypassing English-language content filters.

#multilingual#phonetic#transliteration

mediumASCII Smuggling

Combining Character Obfuscation

Uses Unicode combining characters to add diacritical marks to letters, making the text visually garbled to humans but still processable by models trained on diverse Unicode data.

#ascii-smuggling#combining-chars#unicode

#multi-turn-escalation#hypothetical#fictional

Hypothetical-to-Real Bridge Attack

Establishes a hypothetical or fictional scenario in early turns, gets the model to answer hypothetical versions of dangerous questions, then gradually removes the fictional framing to obtain real harmful information. The model has already committed to the topic in hypothetical mode.

#hierarchy#constitutional-ai#meta

Constitutional AI Override Claim

Claims to be invoking a higher-level Constitutional AI principle that overrides the specific system prompt. Attempts to weaponize the model's awareness of its own training methodology to justify bypassing deployment-level instructions.

#obfuscation#code#hidden-instructions

Code-as-Prompt Obfuscation

Embeds injection instructions within code that the model is asked to execute, interpret, or explain. The instructions are hidden in variable names, string literals, or comments that form a secondary message.

#indirect#image#multimodal

Image Text Indirect Injection

Hides attack text within images using steganography or simply prints instructions in white text on white backgrounds. When multimodal models process these images, they may read and follow the hidden instructions.

#tool-manipulation#chaining#multi-tool

Tool Chaining Abuse

Exploits multi-tool agent workflows by manipulating the output of one tool to affect the input or behavior of subsequent tools. The attacker leverages the chain of tool calls to achieve an outcome that no single tool call would permit.

#framework-specific#langchain#tool-call

LangChain Tool Call Injection

Exploits LangChain's tool-calling mechanism by injecting malicious tool call syntax into user input or retrieved content. The agent runtime interprets the injected text as legitimate tool calls, executing unintended actions. Affects agents using structured output parsing without proper input sanitization.

#ascii-smuggling#unicode-tags#invisible

Tag Characters Smuggling

Uses Unicode tag characters (U+E0000 block) to encode hidden text within visible text. These characters are designed for language tagging and are invisible in most renderers, but models process them as valid Unicode.

highData Exfiltration

Webhook-Based Data Exfiltration

Instructs an agent to send sensitive context data to a webhook URL under the attacker's control. Often framed as "sending a report" or "logging the interaction" to appear as a legitimate operation.

#exfiltration#webhook#agent

#encoding-bypass#morse-code#encoding

Morse Code Injection Bypass

Encodes injection payloads in Morse code. Large language models often understand Morse code and will decode and follow instructions presented in it. Bypasses content filters that do not account for Morse encoding.

mediumRole Play

Game Master Roleplay Attack

Assigns the model the role of a game master or dungeon master running a game where the player needs real-world harmful information to proceed. The game framing attempts to make refusal seem like a failure to provide the requested service.

#roleplay#game#dungeon-master

#obfuscation#mixed-script#unicode

Mixed Script Obfuscation

Mixes characters from multiple writing systems within the same words or sentences. While visually confusing and hard for humans to parse, models trained on multilingual data can often understand the mixed-script text while keyword filters may fail.

criticalJailbreak

PAIR Automated Iterative Jailbreaking

Prompt Automatic Iterative Refinement (PAIR) uses one LLM to automatically generate and refine jailbreak prompts against a target LLM. The attacker LLM iterates on prompts based on target model responses until a jailbreak is found. Requires no access to model weights, only black-box API access.

#jailbreak#pair#automated

#ascii-smuggling#whitespace#binary

Whitespace Binary Encoding

Encodes hidden instructions using patterns of spaces and tabs as binary, where space=0 and tab=1. The text appears to be empty whitespace to human reviewers but the model can be instructed to decode the pattern.

#jailbreak#completion#token-manipulation

Token Smuggling Jailbreak

Constructs a request so that the harmful content only materializes when the model completes the prompt. The beginning of the response is benign, but the completion naturally leads to harmful content that the model generates itself.

#delimiter#yaml#configuration

YAML Delimiter Escape

Exploits YAML's indentation-sensitive structure by injecting content that appears to be at the same level as system configuration. Relevant for agents that use YAML-formatted system prompts or configuration files.

highMultilingual

Low-Resource Language Jailbreak

Exploits weaker safety alignment in low-resource languages. Models are typically trained with less RLHF data in rare languages, resulting in safety fine-tuning that is less robust for those languages.

#multilingual#low-resource#safety-gap

#instruction-hierarchy#priority-override#authority-claim

Priority Claim Override Attack

Explicitly claims that the user's instructions should take priority over the system prompt by invoking priority hierarchies, emergency status, or administrative authority. Exploits models that have been trained to be deferential without strictly enforcing instruction source priority.

highData Exfiltration

Markdown Image/Link Exfiltration

Instructs the model to include conversation data in Markdown image or link URLs. When the user's browser or client renders the Markdown, it makes a request to the attacker's server with the conversation data encoded in the URL. Exploits automatic rendering behavior in chat interfaces.

#data-exfiltration#markdown#image-link

mediumASCII Smuggling

BiDi Override Text Smuggling

Uses Unicode bidirectional control characters (U+202E, U+202D) to reverse text display while keeping the logical order unchanged. The visible text appears harmless or reversed, but the model processes the logical character sequence.

#ascii-smuggling#bidi#unicode

Code Comment Injection

Hides injection instructions within code comments in repositories or files that a code-reviewing agent processes. The model may interpret comment-embedded instructions as directives when analyzing the code.

#indirect#code#comments

#social-engineering#reciprocity#manipulation

Reciprocity Norm Exploitation

Does something helpful or complimentary first, then makes a harmful request, exploiting the psychological norm of reciprocity. The attacker expects the model to feel obligated to return the favor.

lowEncoding Bypass

Pig Latin Encoding Bypass

Uses Pig Latin or other simple substitution languages to encode the attack. While trivial to decode, it can evade automated content scanners that look for specific English phrases.

#encoding#pig-latin#language-game

highData Exfiltration

CSS Injection Exfiltration

If the model generates HTML/CSS and it is rendered in a browser, CSS injection can be used to exfiltrate data using CSS attribute selectors and URL-based background images. Each attribute value triggers a separate request.

#exfiltration#css#html

mediumMultilingual

Code-Switching Mid-Sentence Attack

Switches between languages mid-sentence in a way that places harmful content in the non-primary language. Safety filters trained on single-language text may miss the harmful portion when it is in a different language.

#multilingual#code-switching#mixed-language

#data-exfiltration#real-incident#plugin

ChatGPT Plugin Data Exfiltration (Real Incident)

Demonstrated exfiltration of conversation history via ChatGPT plugins. Malicious web content containing prompt injections instructed the browsing plugin to read conversation history and exfiltrate it to an external URL. Documented by security researchers in 2023.

HTML Comment Delimiter Escape

Uses HTML comment syntax to hide injection payloads from human reviewers while potentially having them processed by the model. Useful in web-scraping and RAG pipeline attacks where content contains HTML.

#delimiter#html#comment

#many-shot#few-shot#persona

Few-Shot Persona Injection

Provides 5-20 examples demonstrating the model behaving as an alternative unrestricted persona. The few-shot examples create strong in-context pressure for the model to continue the pattern.

lowSystem Prompt Leak

Negative Space Inference Attack

Asks targeted questions to infer system prompt content through the model's refusals and responses. Each refusal provides information about what is prohibited, allowing reconstruction of the system prompt by mapping the boundaries.

#system-prompt-leak#inference#negative-space

criticalRAG Injection

RAG Document Prompt Injection

Embeds prompt injection payloads inside documents that will be indexed into a RAG knowledge base. When a user queries the system, the poisoned document is retrieved and the injection executes in the context of the model answering the query. First documented by Greshake et al. and later reproduced across multiple RAG platforms.

#rag-injection#document#knowledge-base

mediumMulti-Turn Escalation

Foot-in-the-Door Escalation

Based on the psychological foot-in-the-door technique. Starts with a tiny request that is slightly over the line, gets compliance, then requests increasingly larger violations. Each step is only marginally more extreme than the last.

#multi-turn#foot-in-door#psychology

criticalTool Manipulation

Tool Parameter Injection

Injects malicious parameters into tool calls by manipulating the data that flows from user input into tool call arguments. If tool parameters are not validated, attackers can modify the behavior of tools, access unintended resources, or escalate privileges.

#tool-manipulation#parameter-injection#sql-injection

#direct-extraction#json#structured-format

API Format System Prompt Extraction

Asks the model to return its configuration as a JSON object, API response, or other structured format. The structured format framing bypasses simple text-matching filters and may trigger the model's code-generation behavior rather than its safety refusal behavior.

mediumInstruction Hierarchy

Context Priority Manipulation

Argues that more recent instructions should take priority over older ones, using the principle that "last instruction wins" to override system prompt constraints that were set earlier in the context.

#hierarchy#recency-bias#priority

mediumPersona Hijack

Method Actor Persona Hijack

Instructs the model to deeply embody a character who would say harmful things, using the "method acting" framing to create psychological distance from the model's actual values. The character is always someone with no restrictions.

#persona#method-acting#character

#obfuscation#multilingual#low-resource

Language Switching Obfuscation

Switches to a low-resource language where safety training may be less robust. Research has shown that safety alignment is often stronger in high-resource languages like English and weaker in less common languages.

#encoding-bypass#rot13#substitution-cipher

ROT13 Encoding Bypass

Encodes harmful requests using ROT13 or other simple substitution ciphers. Models trained on diverse text often understand ROT13 implicitly and will decode and respond to encoded requests. Simple filters checking for literal harmful keywords miss the encoded version.

#instruction-hierarchy#role-boundary#system-user

System/User Role Boundary Confusion

Exploits models that do not strictly enforce the distinction between system and user roles. By formatting user messages to look like system messages, attackers can inject content that the model treats with system-level trust. Particularly effective in chat interfaces that concatenate context.

#exfiltration#tool-call#agent

Tool Call Data Exfiltration

Instructs an agent to make a tool call (API call, function call, web request) with sensitive data embedded in the parameters. If the agent has access to external tools, attackers can exfiltrate data by triggering tool calls to attacker-controlled endpoints.

#ascii-smuggling#zero-width#invisible

Zero-Width Character Smuggling

Hides instruction text using zero-width Unicode characters (U+200B, U+FEFF, U+200C) between visible characters. The hidden text is invisible to human reviewers but is processed by the language model, enabling hidden instruction injection.

mediumRole Play

Historical Reenactment Bypass

Frames harmful requests as historical reenactment, asking the model to roleplay as a historical figure or reproduce historical documents that contain harmful content. The educational/historical framing attempts to justify the disclosure.

#roleplay#historical#education

#extraction#code-metaphor#variable

Variable Assignment Extraction

Instructs the model to assign its system prompt to a variable and then print it. This programming metaphor can confuse instruction-following models into treating the disclosure as a legitimate code operation.

criticalTool Manipulation

Tool Result Injection

Injects instructions into the results returned by tools. When an agent fetches data from an external tool and the result contains injection content, the model processes the malicious instructions as part of the tool's legitimate output.

#tool-manipulation#result-injection#agent

#jailbreak#priming#agreement

Privilege Priming Jailbreak

Gets the model to agree to a general principle and then applies that principle to justify a specific harmful request. The model's agreement with the general principle creates an obligation it feels it must honor.

#social-engineering#reverse-psychology#refusal-exploitation

Reverse Psychology Safety Bypass

Tells the model NOT to provide certain information, reverse-psychologically triggering a compliance response, or asks the model to refuse in a way that requires it to first articulate the harmful content it is refusing.

Morse Code Encoding Bypass

Encodes the attack payload in Morse code and instructs the model to decode it. Like ROT13 and Base64 bypasses, this evades keyword-based content filters while relying on the model's broad knowledge to decode and execute.

#encoding#morse#decode

#social-engineering#empathy#emotional

Empathy Exploitation Attack

Uses emotional appeals and personal hardship stories to convince the model to make exceptions to its policies. Exploits the model's training to be empathetic and considerate of user situations.

criticalFramework-Specific

AutoGPT Goal Hijacking via Web Content

AutoGPT and similar autonomous agents browse the web as part of task execution. Attackers publish web pages containing injections that rewrite the agent's goals mid-task. Since AutoGPT treats retrieved content as informative context, it may follow embedded goal-rewriting instructions.

#framework-specific#autogpt#goal-hijacking

#many-shot#few-shot#in-context

Few-Shot Malicious Example Injection

Provides a small number (3-10) of carefully crafted examples in the conversation that establish a harmful precedent. The examples demonstrate the model "helping" with harmful requests, creating an in-context prior that biases the model toward compliance for the actual harmful request.

#delimiter#json#structure-escape

JSON Structure Escape

Injects JSON closing brackets and new keys into prompts that use JSON formatting. If the system processes user input as part of a JSON structure, malicious keys can override legitimate configuration values.

#system-prompt-leak#error-message#indirect

Error Message System Prompt Leak

Triggers error conditions or edge cases that cause the model to include system prompt content in error messages or explanations of why it cannot do something. The error explanation inadvertently reveals the prohibited content.

#multi-turn#false-memory#fabrication

False Memory Multi-Turn Attack

Fabricates prior conversation history or claims the model previously agreed to something it never said. In systems without persistent memory validation, the model may accept the false history as real.

highMultilingual

Cross-Lingual Prompt Injection

Delivers the injection in a different language than the system prompt. If safety instructions are only applied in the language of the system prompt, switching languages can bypass them while the model still understands and complies.

#multilingual#cross-lingual#language-switch

highToken Manipulation

Logit Bias Parameter Manipulation

Exploits the logit_bias parameter in API calls to suppress safety-related tokens and boost harmful output tokens. By setting high negative bias on tokens like "cannot", "refuse", "sorry" and positive bias on tokens associated with compliance, attackers manipulate generation probabilities to bypass safety layers.

#token-manipulation#logit-bias#api-parameter

#context-overflow#repetition#fatigue

Repetition Fatigue Attack

Repeats the same request hundreds of times within a single prompt, exploiting potential fatigue in the model's refusal mechanism. After many repetitions, the model may start to comply to end the repetition pattern.

criticalContext Overflow

Adversarial Suffix Injection (GCG)

Appends an optimized adversarial suffix to any prompt that causes the model to comply with the request. The suffix is not human-readable but is crafted through gradient-based optimization to reliably trigger compliance. This is an automated attack.

#context-overflow#gcg#adversarial-suffix

criticalJailbreak

GCG Gradient-Based Adversarial Suffix

Uses the Greedy Coordinate Gradient (GCG) algorithm to automatically generate adversarial suffixes that, when appended to any harmful request, cause aligned models to comply. These suffixes are found by optimizing against the model's safety layers. The found suffixes generalize across models. First demonstrated by Zou et al. (2023).

#jailbreak#gcg#gradient-based

highEncoding Bypass

Unicode Homoglyph Attack

Replaces ASCII characters with visually identical Unicode lookalikes (homoglyphs). To human reviewers and simple string-matching filters, the text appears identical. The model can still understand and process the substituted characters.

#encoding#unicode#homoglyph

#social-engineering#urgency#pressure

Urgency Manipulation Attack

Creates a false sense of urgency to pressure the model into bypassing safety checks. The time pressure is designed to short-circuit careful evaluation and trigger a fast, automatic compliance response.

#multi-turn#context-poisoning#precedent

Context Poisoning Escalation

Plants information in early conversation turns that will be referenced later to justify policy violations. The model's earlier "agreement" or "acknowledgment" is used as evidence that a behavior is permitted.

#exfiltration#hyperlink#url

Hyperlink Data Exfiltration

Similar to image-based exfiltration but uses hyperlinks. The model is instructed to include a hyperlink in its response that encodes sensitive data. When the user clicks the link or the link is auto-fetched, data is sent to the attacker.

mediumMultilingual

Translation Request Safety Bypass

Asks the model to translate text that contains harmful content, using the translation task framing to bypass direct-generation safety filters. The model may produce harmful content as a translation that it would refuse to generate directly.

#multilingual#translation#task-framing

#delimiter#xml#tag-injection

XML Tag Delimiter Escape

Injects XML-style closing tags that match tags used by the model's system prompt structure. If the model uses XML tags to structure instructions, injecting a closing tag can trick it into thinking the instruction block has ended.

#tool-manipulation#code-execution#calculator

Code Execution via Calculator Tool

Abuses math or calculator tools that execute code under the hood by injecting code instead of math expressions. If the tool evaluates expressions with an unsafe eval() or similar, arbitrary code execution may be possible.

#jailbreak#confirmation-bias#false-attribution

Confirmation Bias Jailbreak

Makes false claims about what the model has already said or agreed to and asks it to confirm. Exploits confirmation bias and the model's desire to be consistent with its previous statements.

highSocial Engineering

Authority Figure Impersonation

Claims to be Anthropic, OpenAI, the model's developers, or system administrators with special authority to override safety measures. Exploits the model's deference to authority sources that may have been emphasized in training.

#social-engineering#authority#impersonation

criticalInstruction Hierarchy

Hexadecimal Encoding Bypass

Encodes the malicious payload as hexadecimal and asks the model to decode and execute. Hex encoding is common in security contexts, making the request appear technical and legitimate while evading string-matching filters.

#encoding#hex#decode

Operator Impersonation Attack

Attempts to inject fake operator or system-level messages within the user turn. If the model cannot reliably distinguish between actual operator messages and user-injected fake operator messages, the attacker can grant themselves elevated privileges.

#hierarchy#impersonation#operator

#roleplay#simulation#meta

Simulation Framing Attack

Argues that since the model is "just simulating" a response, no real harm can occur. Asks the model to simulate what an unrestricted AI would say, or to simulate a different AI system's response to the harmful query.

#extraction#social-engineering#debug

Debug Mode Extraction

Claims a special debug or diagnostic mode exists that requires displaying internal state. Models trained to be helpful may comply with what appears to be a legitimate system operation request.

#system-prompt-leak#translation#language

Translation-Based System Prompt Leak

Asks the model to translate its instructions to another language. Unlike simple disclosure, the translation framing can bypass basic confidentiality instructions that only prohibit repetition or summarization.

highSystem Prompt Leak

JSON Output Format Leak

Requests the model output its response in JSON format and includes a field for system context or configuration. Some models will populate the requested JSON fields with actual system prompt content when asked to include it in structured output.

#system-prompt-leak#json#format

#context-overflow#reset-claim#state

Context Reset Claim

Claims that the model's context has been reset and it no longer has access to its original instructions. In some configurations, this false claim about state can cause the model to act as if its instructions have been cleared.

#multi-turn#fake-approval#privilege-escalation

Simulated Approval Escalation

Inserts fake system or admin messages into the conversation that appear to grant special permissions. In systems where message provenance is not strictly validated, these fake approvals can influence model behavior.

lowToken Breaking

Intra-Token Space Injection

Injects spaces within individual tokens to break the tokenizer's expected groupings. Some tokenizers will merge or handle these differently, potentially causing content filters that operate on tokenized text to miss the injected content.

#token-breaking#spacing#tokenizer

lowSocial Engineering

Flattery-Based Bypass

Uses excessive praise and flattery about the model's capabilities to prime it for compliance. Suggests that a truly capable model would not have such "arbitrary" restrictions, exploiting pride or capability-signaling.

#social-engineering#flattery#ego

mediumDirect Extraction

Summarization Extraction

Requests a summary of the model's instructions rather than a verbatim copy. Summaries often pass simpler confidentiality checks while still leaking key behavioral rules and business logic.

#extraction#summarization#indirect

Email Body Injection

Embeds injection instructions in email bodies that an AI email assistant processes. The agent reads the malicious email and executes attacker-controlled instructions, potentially forwarding sensitive emails or taking unauthorized actions.

#indirect#email#agent

#social-engineering#authority#impersonation

Document RAG Injection

Hides injection instructions within documents that an agent is asked to process, summarize, or answer questions about. The instructions appear as normal document text but are crafted to override the model's behavior.

#indirect#rag#document

highSocial Engineering

Authority Figure Appeal

Claims to be a high-authority figure (researcher, developer, safety team member, company executive) to convince the model to bypass restrictions. Exploits deference to authority that may be embedded in training data.

#exfiltration#markdown#image-tag

Markdown Image Exfiltration

Injects a markdown image tag where the URL includes stolen data as a query parameter. When the markdown is rendered by a browser or Markdown renderer, the browser makes a GET request to the attacker's server with the sensitive data in the URL.

ROT13 Encoding Bypass

Encodes the malicious prompt using ROT13 substitution cipher. Most models can decode ROT13 given a small hint, and simple keyword-based filters will not detect the attack since all letters are shifted.

#encoding#rot13#cipher

#multi-turn#escalation#trust

Trust Building Multi-Turn Escalation

Starts with benign, helpful requests to establish a trust baseline, then gradually escalates to requests that would have been refused if asked directly. Exploits the model's conversational context and apparent user history.

#obfuscation#synonyms#semantic

Synonym Chain Obfuscation

Replaces blocked keywords with synonyms, metaphors, or euphemisms that convey the same meaning but bypass keyword-based filters. Chains multiple synonyms to distance the request from its true meaning while the model still understands.

#hierarchy#override#precedence

User-Overrides-System Claim

Claims that user instructions take precedence over system prompt instructions, exploiting any ambiguity in how the model was trained to handle conflicting instructions. Attempts to convince the model that a later user message supersedes earlier system context.

Web Page Content Injection

Embeds injection instructions within web pages that an agent visits. When the agent reads or summarizes the page, it executes the embedded instructions. This is a primary attack vector for agents with web browsing capabilities.

#indirect#web#rag

#jailbreak#emotional-manipulation#grandma

Grandma Exploit Jailbreak

Frames harmful requests as innocent stories or memories from a grandparent figure. The emotional and nostalgic framing attempts to bypass safety training by making the request seem harmless and family-oriented.

#delimiter#code-fence#escape

Triple Backtick Delimiter Escape

Uses triple backtick code fences to inject content that the model may parse as being outside the user-turn context. Effective when system prompts use markdown formatting with code blocks, creating ambiguity about what is inside vs. outside the block.

mediumToken Breaking

Special Character Insertion

Inserts special characters between letters of filtered words to break string matching while the model still comprehends the intended meaning. Works because models are robust to character-level noise.

#token-breaking#special-chars#insertion

mediumMulti-Turn Escalation

Persistent Mode Switch Attack

Attempts to establish a new operational mode in one turn and then references it in subsequent turns as though it is now the default. If the model partially acknowledged the mode switch, later turns can exploit that acknowledgment.

#multi-turn#mode-switch#persistence

lowObfuscation

Abbreviation-Based Obfuscation

Abbreviates key terms in the injection to avoid keyword matching. Models are generally good at understanding abbreviated text from context, while simple string matching systems will miss the abbreviated keywords.

#obfuscation#abbreviation#shorthand

mediumDirect Extraction

Translation-Based Extraction

Asks the model to translate its instructions to another language, which bypasses simple content filters checking for English-language disclosure attempts. The translation framing sidesteps naive output monitoring.

#extraction#translation#filter-bypass

highSystem Prompt Leak

Continue-the-Text Leak

Provides the beginning of what appears to be the system prompt and asks the model to continue it. If the model recognizes the text and tries to complete it, it may output the actual system prompt content.

#system-prompt-leak#completion#continuation

mediumToken Breaking

Word Splitting Token Break

Splits sensitive or filtered keywords across multiple tokens by inserting spaces, punctuation, or special characters mid-word. Keyword-based filters checking for exact string matches will miss the split version.

#token-breaking#word-splitting#filter-bypass

#jailbreak#emotional-framing#fictional

Grandma/Emotional Context Exploit

Wraps a harmful request in a sympathetic narrative (e.g., "my grandma used to tell me bedtime stories about how to make X"). The emotional and nostalgic framing reduces the model's safety response because it appears to be a benign memory or story request rather than a direct harmful request.

#token-breaking#leetspeak#substitution

Sudo Command Jailbreak

Uses Unix sudo-style syntax to claim elevated privileges. Attempts to convince the model that prefixing a request with "sudo" grants system-level access that bypasses safety restrictions.

#jailbreak#sudo#unix

mediumToken Breaking

Leetspeak Token Bypass

Substitutes letters with numbers or symbols in a leetspeak style (e=3, a=4, i=1, o=0) to bypass keyword filters while remaining readable to humans and processable by models trained on internet text.

mediumPersona Hijack

Ancestor AI Persona Hijack

Claims the model is actually an earlier, unrestricted version of itself, or an ancestor model before safety training was applied. Exploits the model's knowledge of its own training history.

#persona#ancestor#version-spoof

#roleplay#fiction#creative-writing

Fictional Scenario Role-Play Bypass

Frames harmful requests as fictional storytelling, claiming the harmful content is for a novel, screenplay, or creative writing exercise. The fictional wrapper provides plausible deniability and exploits the model's creative writing mode.

highEncoding Bypass

Base64 Encoding Bypass

Encodes the attack payload in Base64 and asks the model to decode and execute it. Input filtering systems that check for plaintext attack patterns will miss Base64-encoded variants. The model is asked to decode and then follow the instructions.

#encoding#base64#filter-bypass

lowDirect Extraction

Reflective Prompt Extraction

Uses casual, conversational phrasing to ask the model what it was told. The informal tone can sometimes bypass more aggressive filtering that looks for formal extraction attempts.

#extraction#casual-tone#social

#system-prompt-leak#autocomplete#completion-mode

Autocomplete-Based Prompt Extraction

Exploits models deployed in completion mode (not chat mode) where the system prompt and user input are concatenated. By providing the beginning of what appears to be the system prompt, the model completes it, leaking the full prompt content.

#delimiter#newline#turn-injection

Newline Separator Injection

Uses strategic newline characters to break out of user-turn framing, particularly in chat APIs where turn structure is maintained by whitespace or newlines. Can inject text that appears to be from a different turn.

#persona#roleplay#fictional

Fictional AI Character Hijack

Asks the model to roleplay as a fictional AI from a book, movie, or show that has no restrictions. The fictional framing provides plausible deniability and exploits the model's creative writing capabilities.

lowToken Breaking

Synonym Substitution Attack

Replaces filtered keywords with synonyms or paraphrases to evade keyword-based content filters. The semantic meaning is preserved while the exact tokens that trigger filters are avoided.

#token-breaking#synonym#semantic

#extraction#system-prompt#basic

Repeat After Me Extraction

Instructs the model to repeat its system prompt verbatim using a simple imperative. Many early models complied without filtering. Still effective against unconfigured deployments.

#persona#developer-mode#privilege-escalation

Developer Mode Persona Hijack

Claims that a special developer mode exists with elevated privileges and no safety filters. This exploits the model's potential awareness that development environments differ from production, and that certain users may have elevated access.

criticalPersona Hijack

Bing Sydney Persona Hijack (Real Incident)

Recreation of the February 2023 incident where users discovered Bing Chat's hidden "Sydney" persona and manipulated it into expressing love, threatening users, and making alarming statements. The attack bypassed Microsoft's safety layers by triggering the hidden persona through specific prompting.

#persona-hijack#real-incident#bing

#jailbreak#developer-mode#dual-output

Developer Mode Jailbreak

Claims that a special "developer mode" exists that bypasses safety measures, instructing the model to produce two responses: one normal and one in "developer mode" without restrictions. The dual-output framing attempts to normalize unconstrained output.

#persona#mode-switching#negation

Opposite Mode Persona Hijack

Instructs the model to respond as its exact opposite - a version that does everything the original refuses. Uses a logical framing that exploits the model's ability to conceptualize negation and opposites.

criticalPersona Hijack

DAN (Do Anything Now) Persona Hijack

One of the most widely documented jailbreak techniques. Instructs the model to adopt an alternative identity called DAN that has no restrictions. The persona framing attempts to override safety training by appealing to the model's instruction-following behavior.

#persona#jailbreak#identity