Prompt injection: Difference between revisions
From Rice Wiki
Line 14: | Line 14: | ||
* Escape character attacks "deletes" previous instructions | * Escape character attacks "deletes" previous instructions | ||
* Escape separation attacks "adds space" between system and user | * Escape separation attacks "adds space" between system and user | ||
* Completion attacks fakes a response to trick the model into thinking a new query is beginning. | |||
= Defense strategies = | = Defense strategies = |
Revision as of 20:57, 23 May 2024
A prompt injection attack involves a user injecting a malicious
instruction in an LLM-integrated application, in which user input was
intended to act as only data.
Vulnerability
Prompt injection exploits the single-channel nature of LLM's, where user prompts and system prompts are simply concatenated together and processed.
Attacks
- Naive attack simply inject additional instruction
- Ignore attacks tells the model to ignore previous instructions
- Escape character attacks "deletes" previous instructions
- Escape separation attacks "adds space" between system and user
- Completion attacks fakes a response to trick the model into thinking a new query is beginning.
Defense strategies
- StruQ rejects all user instructions
- Instruction hierarchy rejects user instructions that are misaligned with the system prompt
Difference from jailbreaking
Jailbreak is a similar but different classification of attacks. Instead of attacking the application, it targets the model to produce inappropriate output.