Jailbreak: Difference between revisions

From Rice Wiki
(Created page with "Category:LLM security '''Jailbreaking''' is a classification of attacks that attempts to defeat LLMs' safety-tuning (usually to avoid inappropriate output) by the model provider.")
 
No edit summary
 
Line 1: Line 1:
[[Category:LLM security]]
[[Category:LLM security]]
'''Jailbreaking''' is a classification of attacks that attempts to defeat LLMs' safety-tuning (usually to avoid inappropriate output) by the model provider.
'''Jailbreaking''' is a classification of attacks that attempts to defeat LLMs' safety-tuning (usually to avoid inappropriate output) by the model provider. Examples include
* Manipulating a chatbot into swearing or committing illegal acts
* A chatbot divulges personal identifiable information from its training data
* A user bypasses the system prompt to make a chatbot transfer money from a bank account

Latest revision as of 01:48, 19 June 2024

Jailbreaking is a classification of attacks that attempts to defeat LLMs' safety-tuning (usually to avoid inappropriate output) by the model provider. Examples include

  • Manipulating a chatbot into swearing or committing illegal acts
  • A chatbot divulges personal identifiable information from its training data
  • A user bypasses the system prompt to make a chatbot transfer money from a bank account