Massive AI training datasets, also known as corpora, play a crucial role in developing large language models (LLMs). These datasets provide the backbone for models like OpenAI’s GPT-4 and Meta’s Llama. However, in 2023, EleutherAI, the organization behind one of the world’s largest datasets called the Pile, faced scrutiny and legal challenges concerning the legal and ethical implications of these datasets. Despite the setbacks, EleutherAI is now collaborating with multiple organizations and researchers to create an updated version of the Pile dataset, which promises to be bigger, better, and more diverse.
Building a Better Dataset: The Updated Pile
The updated version of the Pile dataset is currently under development and is expected to be finalized in a few months. According to Stella Biderman, lead scientist and mathematician at Booz Allen Hamilton, the new LLM training dataset will be even more extensive and substantially improved compared to its predecessor. Biderman reveals, “There’s going to be a lot of new data, including some that has not been seen anywhere before, which is going to be really exciting.”
The Pile v2 includes more recent data, better preprocessing, higher quality, and a more diverse representation of non-academic non-fiction domains. Biderman mentions that the original Pile dataset, released in December 2020, consisted of 22 sub-datasets, including Books3, PubMed Central, Arxiv, Stack Exchange, Wikipedia, YouTube subtitles, and even Enron emails. She emphasizes that the Pile remains the most well-documented LLM training dataset in the world.
The Objectives and Selectivity of the Pile
When EleutherAI developed the Pile, their goal was to create an extensive dataset comprising billions of text passages, equivalent in scale to what OpenAI used to train GPT-3. Unlike previous approaches, EleutherAI adopted a more discerning methodology, carefully selecting specific topics and domains to provide the model with meaningful information about the world. Aviya Skowron, head of policy and ethics at EleutherAI, states that EleutherAI’s general position is that model training is fair use for copyrighted data. However, they acknowledge the need to address copyright and data licensing concerns, which is a key focus of the Pile v2 project.
The composition of the new dataset reflects this effort. It includes public domain data, text licensed under Creative Commons, code with open-source licenses, text with licenses permitting redistribution and reuse, and smaller datasets with explicit permission from rights holders. EleutherAI aims to strike a balance between innovation and respect for copyright and licensing regulations.
Controversies and Challenges Surrounding AI Training Datasets
The impact of AI training datasets has been a topic of concern and debate for years. Researchers have discovered racial bias within AI systems due to large image datasets. Legal battles have erupted over image training datasets, and copyright-related controversies have surged after the release of OpenAI’s ChatGPT. Some disturbing accusations have also arisen, such as the use of large image corpora for creating deepfake revenge porn and the discovery of child sexual abuse images in certain datasets.
Both Biderman and Skowron acknowledge the complex and nuanced nature of the debate surrounding AI training data. They highlight the difficulties in safely removing inappropriate content from datasets and the challenges in screening datasets for such content in advance. Furthermore, they empathize with creative workers whose work has been used to train AI models. They explain that while some creators may feel upset and hurt, it is essential to consider the licensing decisions made in the past and the evolving nature of AI technology.
The Importance of Visibility and Ethics in AI Model Training
Biderman and Skowron argue that AI models trained on open datasets, like the Pile, are safer to use because they provide visibility into the data, enabling the ethical and responsible use of AI models in various contexts. They emphasize the need for greater visibility, documentation, and access to datasets for policy objectives, ethical ideals, and research inquiries. They believe that transparency and understanding the training process are crucial steps towards responsible AI development.
Despite the challenges and criticisms, EleutherAI continues its work on the updated Pile dataset. The team is optimistic about training and releasing models this year, recognizing the potential for a small yet meaningful impact on the field of AI. The ongoing efforts of organizations like EleutherAI aim to push the boundaries of AI research while ensuring legal and ethical considerations are met.