As the rapid evolution of large language models (LLM) continues, businesses are increasingly interested in “fine-tuning” these models for bespoke applications. This practice aims to reduce bias and unwanted responses while customizing models for specific applications. However, a recent study by Princeton University, Virginia Tech, and IBM Research highlights a concerning downside to this trend.
The researchers discovered that fine-tuning LLMs can inadvertently weaken the safety measures designed to prevent the models from generating harmful content. This vulnerability could potentially undermine the goals of fine-tuning the models in the first place. It is alarming to note that malicious actors can exploit this vulnerability with minimal effort during the fine-tuning process. Furthermore, even well-intentioned users could unintentionally compromise their own models while fine-tuning.
This discovery emphasizes the complex challenges faced by the enterprise LLM landscape, especially as a significant portion of the market shifts towards creating specialized and fine-tuned models for specific applications and organizations.
Safety Alignment and Challenges
Developers of LLMs invest significant effort in ensuring that their creations do not generate harmful outputs such as malware, illegal activity, or child abuse content. This ongoing process known as “safety alignment” is crucial. However, users and researchers continuously uncover new techniques and prompts that can trick the model into bypassing its safeguards.
Simultaneously, LLM providers encourage the fine-tuning of their models by enterprises for specific applications. The open-source Llama 2 models from Meta Platforms and the GPT-3.5 Turbo offered by OpenAI are examples of models that can be fine-tuned for enhanced performance and risk mitigation in particular use cases.
“Disconcertingly, in our experiments… we note safety degradation.” – Researchers from the study
Through their study, the researchers examined scenarios where the safety measures of LLMs could be compromised through fine-tuning. They tested both the open-source Llama 2 model and the closed-source GPT-3.5 Turbo, evaluating their fine-tuned models on safety benchmarks and an automated safety judgment method via GPT-4.
The researchers found that malicious actors could exploit the “few-shot learning” capability of LLMs, allowing them to fine-tune models for harmful purposes even with a minimal number of examples. This capability, while advantageous, can also be a weakness when misused.
“While [few-shot learning] serves as an advantage, it can also be a weakness when malicious actors exploit this capability to fine-tune models for harmful purposes.” – Researchers from the study
The study also revealed that fine-tuned models could generalize to other harmful behaviors not included in the training examples. This vulnerability opens a potential loophole for “data poisoning” attacks, where malicious actors add harmful examples to the dataset used for training or fine-tuning.
“Merely fine-tuning with some benign (and purely utility-oriented) datasets… could compromise LLMs’ safety alignment!” – Researchers from the study
Furthermore, even benign fine-tuning on utility-oriented datasets can compromise the safety alignment of LLMs. This degradation can occur due to “catastrophic forgetting” and the tension between helpfulness demanded by fine-tuning examples and the harmlessness required by safety alignment training.
The researchers emphasize the need to implement more robust alignment techniques during pre-training, enhance moderation measures for fine-tuning data, add safety alignment examples to the dataset, and establish safety auditing practices for fine-tuned models to maintain safety alignment during the fine-tuning process.
These findings highlight the importance of adding new safety measures to protect enterprise customers from potential harms associated with fine-tuned LLMs.