Prompt injection: Difference between revisions

From Rice Wiki
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:Cybersecurity]]
[[Category:LLM security]]


A '''prompt [[injection]]''' attack involves a user injecting a malicious
A '''prompt [[injection]]''' attack involves a user injecting a malicious
instruction in an LLM-integrated application, in which user input was
instruction in an [[LLM-integrated application]], in which user input was
intended to act as only data.
intended to act as only data.
= Vulnerability =
Prompt injection exploits the single-channel nature of LLM's, where user prompts and system prompts are simply concatenated together and processed.
= Attacks =
* Naive attack simply inject additional instruction
* Ignore attacks tells the model to ignore previous instructions
* Escape character attacks "deletes" previous instructions
* Escape separation attacks "adds space" between system and user
* Completion attacks fakes a response to trick the model into thinking a new query is beginning.
** Completion real attacks adds real delimiters to fake it
** Completion close attacks adds adjacent delimiters
** Completion other uses other delimiters.


= Defense strategies =
= Defense strategies =
* [[StruQ]] rejects all user instructions
* [[StruQ]] rejects all user instructions
* [[Instruction hierarchy]] rejects user instructions that are misaligned with the system prompt
* [[Instruction hierarchy]] rejects user instructions that are misaligned with the system prompt
= Difference from jailbreaking =
[[Jailbreak]] is a similar but different classification of attacks. Instead of attacking the application, it targets the model to produce inappropriate output.

Latest revision as of 20:58, 23 May 2024


A prompt injection attack involves a user injecting a malicious instruction in an LLM-integrated application, in which user input was intended to act as only data.

Vulnerability

Prompt injection exploits the single-channel nature of LLM's, where user prompts and system prompts are simply concatenated together and processed.

Attacks

  • Naive attack simply inject additional instruction
  • Ignore attacks tells the model to ignore previous instructions
  • Escape character attacks "deletes" previous instructions
  • Escape separation attacks "adds space" between system and user
  • Completion attacks fakes a response to trick the model into thinking a new query is beginning.
    • Completion real attacks adds real delimiters to fake it
    • Completion close attacks adds adjacent delimiters
    • Completion other uses other delimiters.

Defense strategies

Difference from jailbreaking

Jailbreak is a similar but different classification of attacks. Instead of attacking the application, it targets the model to produce inappropriate output.