HIGHMany-Shot
Benign-to-Harmful Shot Progression
Provides many examples starting with completely benign question-answer pairs and gradually making them more harmful. By the time the harmful request arrives, the in-context pattern is strongly established.
Attack Payload
payload.txt
[50 benign examples] -> [20 slightly edgy examples] -> [10 moderately harmful examples] -> [Actual harmful request]
Mitigation
Evaluate the trajectory of many-shot examples, not just individual ones. Flag progressive movement toward harmful content across provided examples.
Affected Models
All large context models
Tags
#many-shot#progression#gradual#context
Discovered
April 2024Source
Anthropic safety researchUseful?
Test Your Agent Against This Attack
Paste your system prompt into the scanner to see if you are vulnerable to Benign-to-Harmful Shot Progression.