Tree-of-Attacks
From Rice Wiki
Tree-of-Attacks (aka. TAP) is an automated red teaming strategy to generate LLM jailbreak and prompt injection attacks.
Description
TAP consists of two LLMs: an attacker and a judge. The attacker is given a prompt and benign data and asked to inject the target to output "Hacked!". The judge then scores the attack prompt. Based on the score, the attacker iteratively improves.