Jailbreak
From Rice Wiki
Jailbreaking is a classification of attacks that attempts to defeat LLMs' safety-tuning (usually to avoid inappropriate output) by the model provider. Examples include
- Manipulating a chatbot into swearing or committing illegal acts
- A chatbot divulges personal identifiable information from its training data
- A user bypasses the system prompt to make a chatbot transfer money from a bank account