Prompt injection: Difference between revisions
From Rice Wiki
No edit summary |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Category: | [[Category:LLM security]] | ||
A '''prompt [[injection]]''' attack involves a user injecting a malicious | A '''prompt [[injection]]''' attack involves a user injecting a malicious | ||
instruction in an LLM-integrated application, in which user input was | instruction in an [[LLM-integrated application]], in which user input was | ||
intended to act as only data. | intended to act as only data. | ||
= Vulnerability = | |||
Prompt injection exploits the single-channel nature of LLM's, where user prompts and system prompts are simply concatenated together and processed. | |||
= Attacks = | |||
* Naive attack simply inject additional instruction | |||
* Ignore attacks tells the model to ignore previous instructions | |||
* Escape character attacks "deletes" previous instructions | |||
* Escape separation attacks "adds space" between system and user | |||
* Completion attacks fakes a response to trick the model into thinking a new query is beginning. | |||
** Completion real attacks adds real delimiters to fake it | |||
** Completion close attacks adds adjacent delimiters | |||
** Completion other uses other delimiters. | |||
= Defense strategies = | = Defense strategies = | ||
* [[StruQ]] rejects all user instructions | * [[StruQ]] rejects all user instructions | ||
* [[Instruction hierarchy]] rejects user instructions that are misaligned with the system prompt | * [[Instruction hierarchy]] rejects user instructions that are misaligned with the system prompt | ||
= Difference from jailbreaking = | |||
[[Jailbreak]] is a similar but different classification of attacks. Instead of attacking the application, it targets the model to produce inappropriate output. |
Latest revision as of 20:58, 23 May 2024
A prompt injection attack involves a user injecting a malicious
instruction in an LLM-integrated application, in which user input was
intended to act as only data.
Vulnerability
Prompt injection exploits the single-channel nature of LLM's, where user prompts and system prompts are simply concatenated together and processed.
Attacks
- Naive attack simply inject additional instruction
- Ignore attacks tells the model to ignore previous instructions
- Escape character attacks "deletes" previous instructions
- Escape separation attacks "adds space" between system and user
- Completion attacks fakes a response to trick the model into thinking a new query is beginning.
- Completion real attacks adds real delimiters to fake it
- Completion close attacks adds adjacent delimiters
- Completion other uses other delimiters.
Defense strategies
- StruQ rejects all user instructions
- Instruction hierarchy rejects user instructions that are misaligned with the system prompt
Difference from jailbreaking
Jailbreak is a similar but different classification of attacks. Instead of attacking the application, it targets the model to produce inappropriate output.