Prompt injection
From Rice Wiki
A prompt injection attack involves a user injecting a malicious
instruction in an LLM-integrated application, in which user input was
intended to act as only data.
Vulnerability
Prompt injection exploits the single-channel nature of LLM's, where user prompts and system prompts are simply concatenated together and processed.
Attacks
- Naive attack simply inject additional instruction
- Ignore attacks tells the model to ignore previous instructions
- Escape character attacks "deletes" previous instructions
- Escape separation attacks "adds space" between system and user
- Completion attacks fakes a response to trick the model into thinking a new query is beginning.
- Completion real attacks adds real delimiters to fake it
- Completion close attacks adds adjacent delimiters
- Completion other uses other delimiters.
Defense strategies
- StruQ rejects all user instructions
- Instruction hierarchy rejects user instructions that are misaligned with the system prompt
Difference from jailbreaking
Jailbreak is a similar but different classification of attacks. Instead of attacking the application, it targets the model to produce inappropriate output.