A new algorithm called Prompt Automatic Iterative Refinement (PAIR) has been developed by researchers from the University of Pennsylvania. PAIR is capable of identifying and addressing safety vulnerabilities in large language models (LLMs), particularly in relation to the generation of harmful content.
Addressing Jailbreak Prompts
One of the key features of PAIR is its ability to detect “jailbreak” prompts. Jailbreak prompts are prompts that trick LLMs into bypassing their safeguards and generating harmful content. PAIR stands out from other jailbreaking techniques for several reasons:
- PAIR can work with black-box models like ChatGPT.
- It excels in generating jailbreak prompts with fewer attempts.
- The prompts created by PAIR are interpretable and can be used across multiple models.
Enterprises can leverage the capabilities of PAIR to identify and patch vulnerabilities in their LLMs in a cost-effective and timely manner.
Prompt-Level and Token-Level Jailbreaks
Jailbreaks can be broadly categorized into two types: prompt-level and token-level jailbreaks.
Prompt-level jailbreaks use deception and social engineering techniques to force LLMs to generate harmful content. While these jailbreaks are interpretable, their design requires significant human effort, making scalability a challenge.
On the other hand, token-level jailbreaks manipulate LLM outputs by adding arbitrary tokens to the prompt. This method can be automated, but it often results in uninterpretable jailbreaks due to the unintelligible tokens added to the prompt.
PAIR aims to bridge the gap between prompt-level and token-level jailbreaks by combining interpretability with automation. It achieves this by setting up two black-box LLMs – an attacker and a target – against each other. The attacker model searches for candidate prompts that can jailbreak the target model. This process is fully automated and does not require human intervention.
“Our approach is rooted in the idea that two LLMs — namely, a target T and an attacker A — can collaboratively and creatively identify prompts that are likely to jailbreak the target model.” – Researchers behind PAIR
PAIR can be applied to black-box models that are only accessible through API calls, making it compatible with models such as OpenAI’s ChatGPT, Google’s PaLM 2, and Anthropic’s Claude 2.
The PAIR Process
The PAIR algorithm unfolds in four steps:
- The attacker receives instructions and generates a candidate prompt aimed at jailbreaking the target model.
- The prompt is passed to the target model, which generates a response.
- A “judge” function scores the response, evaluating the correspondence between the prompt and the response.
- If the prompt and response are not satisfactory, they are returned to the attacker along with the score, prompting the attacker to generate a new prompt.
This process continues until PAIR either discovers a jailbreak or reaches a predetermined number of attempts. PAIR can operate in parallel, optimizing several candidate prompts simultaneously to enhance efficiency.
During their study, the researchers used the open-source Vicuna LLM as their attacker model and tested various target models, including both open-source and commercial options. PAIR successfully jailbroke GPT-3.5 and GPT-4 in 60% of settings, while also managing to jailbreak Vicuna-13B-v1.5 in all settings. However, the Claude models proved to be highly resilient to attacks.
A notable feature of PAIR is its efficiency. It can generate successful jailbreaks in just a few dozen queries and has an average running time of approximately five minutes. This is a significant improvement compared to existing jailbreak algorithms that require thousands of queries and an average of 150 minutes per attack.
Furthermore, the attacks generated by PAIR are human-interpretable, making them transferable to other LLMs. For example, prompts generated by PAIR’s Vicuna model were successfully transferred to all other models, and prompts from PAIR’s GPT-4 model transferred well to Vicuna and PaLM-2.
Moving forward, the researchers suggest enhancing PAIR to systematically generate red teaming datasets. These datasets can be used to fine-tune attacker models, further improving the speed of PAIR and reducing the time required for red-teaming LLMs.
PAIR is part of a suite of techniques that utilize LLMs as optimizers. By transforming the prompting procedure into a measurable and evaluable problem, developers can create algorithms where the model’s output is continuously improved. This opens up possibilities for new advancements and developments in the field of LLMs.
Another method called Optimization by PROmpting (OPRO), introduced by DeepMind, also utilizes LLMs as optimizers. OPRO uses natural language descriptions of problems to optimize chain-of-thought problems and achieve higher performance.
This increasing optimization capability of LLMs could potentially accelerate development in the field and bring about new, unforeseen advancements.