Prompt injection
From Rice Wiki
A prompt injection attack involves a user injecting a malicious
instruction in an LLM-integrated application, in which user input was
intended to act as only data.
Vulnerability
Prompt injection exploits the single-channel nature of LLM's, where user prompts and system prompts are simply concatenated together and processed.
Defense strategies
- StruQ rejects all user instructions
- Instruction hierarchy rejects user instructions that are misaligned with the system prompt
Difference from jailbreaking
Jailbreak is a similar but different classification of attacks. Instead of attacking the application, it targets the model to produce inappropriate output.