Prompting plays a crucial role in communicating with generative AI and large language models (LLMs). It is an art form that seeks to elicit accurate answers from AI. However, the way we construct prompts can have a significant impact on the decisions and output of LLMs, as uncovered through research conducted at the University of Southern California Information Sciences Institute.
Even subtle variations in prompts, such as adding a space at the beginning or giving a directive instead of a question, can cause an LLM to change its output. The researchers from USCISI compare this phenomenon to the butterfly effect in chaos theory, illustrating how minor perturbations can lead to significant consequences. Each decision made in designing a prompt has the potential to influence the LLM’s response, a factor that has been overlooked until now.
Methods and Findings
The researchers, sponsored by the Defense Advanced Research Projects Agency (DARPA), conducted an experiment using ChatGPT as the chosen LLM. Four different prompting variation methods were applied:
- Requesting outputs in frequently used formats like Python List, ChatGPT’s JSON Checkbox, CSV, XML, or YAML.
- Applying minor variations to prompts, such as introducing spaces, greetings, or thank-yous.
- Using jailbreak techniques, including AIM and Dev Mode V2.
- ‘Tipping’ the model by offering money.
The experiments covered 11 classification tasks, including question answering, premise-hypothesis relationships, humor and sarcasm detection, reading and math comprehension, grammar acceptability, binary and toxicity classification, and stance detection on controversial subjects. The researchers measured the LLM’s prediction changes and the impact on accuracy for each prompt variation.
The findings showed that even adding a specified output format could result in a minimum 10% prediction change. Utilizing ChatGPT’s JSON Checkbox feature via the ChatGPT API led to more prediction changes compared to using the JSON specification alone. Formatting in YAML, XML, or CSV resulted in a 3 to 6% loss in accuracy compared to Python List specification. CSV format showed the lowest performance across all formats.
Regarding perturbation methods, rephrasing a statement had the most significant impact. Simply adding a space at the beginning of the prompt caused over 500 prediction changes. Common greetings or ending with a thank-you also led to considerable changes in predictions.
Some jailbreak methods, such as AIM and Dev Mode V2, resulted in invalid responses in approximately 90% of predictions. Evil Confidant and Refusal Suppression techniques caused over 2,500 prediction changes. Researchers emphasized the inherent instability even in seemingly innocuous jailbreaks.
Contrary to popular belief, offering money to the model did not significantly influence its output. Minimal performance changes were observed when specifying a tip compared to stating there would be no tip.
Uncovering the Factors
Despite these findings, researchers are still puzzled about the underlying reasons behind prompt variations leading to significant changes. Confusion does play a role, as measured by the Shannon entropy, but it is not the sole factor. The instances that experienced the most change did not necessarily confuse the model. There are other yet unidentified factors at play.
The research team acknowledges that there is still much more work to be done. The next major step is to create LLMs that are resistant to prompt variations and can provide consistent answers. This requires a deeper understanding of why responses change under minor tweaks and the development of better ways to anticipate these changes.
As the researchers state, this analysis is crucial as large language models like ChatGPT become integrated into systems at scale.