As the debate heats up around the use of copyrighted works to train large language models (LLMs) such as OpenAI’s ChatGPT, Meta’s Llama 2, and Anthropic’s Claude 2, one obvious question arises: can these models even be altered or edited to remove their knowledge of such works, without totally retraining them or rearchitecting them?
In a new paper published on the open access and non-peer reviewed site arXiv.org, Microsoft Research’s Ronen Eldan and Microsoft Azure’s Mark Russinovich propose a novel approach to addressing this issue. They present a method for erasing specific information from a sample LLM, focusing on eliminating all knowledge of the existence of the Harry Potter books (including characters and plots) from Meta’s open-source Llama 2-7B.
“While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of finetuning, we effectively erase the model’s ability to generate or recall Harry Potter-related content,”
This groundbreaking work paves the way for more adaptable language models. The ability to refine AI over time to align with shifting organizational needs is crucial for long-term, enterprise-safe deployments.
Unlearning Specific Information in LLMs
Traditional models of machine learning primarily focus on adding or reinforcing knowledge through basic fine-tuning but do not provide straightforward mechanisms to “forget” or “unlearn” knowledge. To overcome this limitation, Eldan and Russinovich developed a three-part technique to approximate unlearning specific information in LLMs.
- First, they trained a model on the target data (Harry Potter books) to identify tokens most related to it compared to a baseline model.
- Second, they replaced unique Harry Potter expressions with generic counterparts and generated alternative predictions to simulate a model without that training.
- Third, they fine-tuned the baseline model on these alternative predictions, effectively erasing the original text from its memory when prompted with the context.
To evaluate the effectiveness of their technique, the authors tested the model’s ability to generate or discuss Harry Potter content using 300 automatically generated prompts and by inspecting token probabilities.
“To the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models,”
They found that while the original model could easily discuss intricate Harry Potter plot details, after only an hour of finetuning their technique, “it’s possible for the model to essentially ‘forget’ the intricate narratives of the Harry Potter series.” Performance on standard benchmarks like ARC, BoolQ, and Winogrande “remains almost unaffected.”
Future Implications and Limitations
While this proof-of-concept is a significant step towards creating more responsible, adaptable, and legally compliant LLMs, further testing and refinement are necessary. The authors acknowledge that their evaluation approach has limitations and more research is needed to assess the applicability of their technique across various content types.
It is worth considering that their technique may be more effective for fictional texts than non-fiction, as fictional worlds tend to have more unique references. Nonetheless, this study offers a promising foundation for the development of more robust techniques for selective forgetting in LLMs.
As the authors conclude, further refinement could help address “ethical guidelines, societal values, or specific user requirements.” Moving forward, more advanced techniques for selective forgetting will ensure that AI systems remain dynamically aligned with shifting priorities, whether they are business or societal, as needs evolve over time.